Linux kernel学习-block层
本文同步自(如浏览不正常请点击跳转):https://zohead.com/archives/linux-kernel-learning-block-layer/
Linux 内核中的 block I/O 层又是非常重要的一个概念,它相对字符设备的实现来说复杂很多,而且在现今应用中,block 层可以说是随处可见,下面分别介绍 kernel block I/O 层的一些知识,你需要对块设备、字符设备的区别清楚,而且对 kernel 基础有一些了解哦。
1、buffer_head 的概念:
buffer_head 是 block 层中一个常见的数据结构(当然和下面的 bio 之类的结构相比就差多了哦,HOHO)。
当块设备中的一个块(一般为扇区大小的整数倍,并不超过一个内存 page 的大小)通过读写等方式存放在内存中,一般被称为存在 buffer 中,每个 buffer 和一个块相关联,它就表示在内存中的磁盘块。kernel 因此需要有相关的控制信息来表示块数据,每个块与一个描述符相关联,这个描述符就被称为 buffer head,并用 struct buffer_head 来表示,其定义在 <linux/buffer_head.h> 头文件中。
enum bh_state_bits {
BH_Uptodate, /* Contains valid data */
BH_Dirty, /* Is dirty */
BH_Lock, /* Is locked */
BH_Req, /* Has been submitted for I/O */
BH_Uptodate_Lock,/* Used by the first bh in a page, to serialise
* IO completion of other buffers in the page
*/
BH_Mapped, /* Has a disk mapping */
BH_New, /* Disk mapping was newly created by get_block */
BH_Async_Read, /* Is under end_buffer_async_read I/O */
BH_Async_Write, /* Is under end_buffer_async_write I/O */
BH_Delay, /* Buffer is not yet allocated on disk */
BH_Boundary, /* Block is followed by a discontiguity */
BH_Write_EIO, /* I/O error on write */
BH_Ordered, /* ordered write */
BH_Eopnotsupp, /* operation not supported (barrier) */
BH_Unwritten, /* Buffer is allocated on disk but not written */
BH_Quiet, /* Buffer Error Prinks to be quiet */
BH_PrivateStart,/* not a state bit, but the first bit available
* for private allocation by other entities
*/
};
struct buffer_head {
unsigned long b_state; /* buffer state bitmap (see above) */
struct buffer_head *b_this_page;/* circular list of page's buffers */
struct page *b_page; /* the page this bh is mapped to */
sector_t b_blocknr; /* start block number */
size_t b_size; /* size of mapping */
char *b_data; /* pointer to data within the page */
struct block_device *b_bdev;
bh_end_io_t *b_end_io; /* I/O completion */
void *b_private; /* reserved for b_end_io */
struct list_head b_assoc_buffers; /* associated with another mapping */
struct address_space *b_assoc_map; /* mapping this buffer is
associated with */
atomic_t b_count; /* users using this buffer_head */
};
b_state 字段说明这段 buffer 的状态,它可以是 bh_state_bits 联合(也在上面的代码中,注释说明状态,应该比较好明白哦)中的一个或多个与值。b_count 为 buffer 的引用计数,它通过 get_bh、put_bh 函数进行原子性的增加和减小,需要操作 buffer_head 时调用 get_bh,完成之后调用 put_bh。b_bdev 表示关联的块设备,下面会单独介绍 block_device 结构,b_blocknr 表示在 b_bdev 块设备上 buffer 所关联的块的起始地址。b_page 指向的内存页即为 buffer 所映射的页。b_data 为指向块的指针(在 b_page 中),并且长度为 b_size。
在 Linux 2.6 版本以前,buffer_head 是 kernel 中非常重要的数据结构,它曾经是 kernel 中 I/O 的基本单位(现在已经是 bio 结构),它曾被用于为一个块映射一个页,它被用于描述磁盘块到物理页的映射关系,所有的 block I/O 操作也包含在 buffer_head 中。但是这样也会引起比较大的问题:buffer_head 结构过大(现在已经缩减了很多),用 buffer head 来操作 I/O 数据太复杂,kernel 更喜欢根据 page 来工作(这样性能也更好);另一个问题是一个大的 buffer_head 常被用来描述单独的 buffer,而且 buffer 还很可能比一个页还小,这样就会造成效率低下;第三个问题是 buffer_head 只能描述一个 buffer,这样大块的 I/O 操作常被分散为很多个 buffer_head,这样会增加额外占用的空间。因此 2.6 开始的 kernel (实际 2.5 测试版的 kernel 中已经开始引入)使用 bio 结构直接处理 page 和地址空间,而不是 buffer。
2、bio:
说了一堆 buffer_head 的坏话,现在来看看它的替代者:bio,它倾向于为 I/O 请求提供一个轻量级的表示方法,它定义在 <linux/bio.h> 头文件中。
struct bio {
sector_t bi_sector; /* device address in 512 byte
sectors */
struct bio *bi_next; /* request queue link */
struct block_device *bi_bdev;
unsigned long bi_flags; /* status, command, etc */
unsigned long bi_rw; /* bottom bits READ/WRITE,
* top bits priority
*/
unsigned short bi_vcnt; /* how many bio_vec's */
unsigned short bi_idx; /* current index into bvl_vec */
/* Number of segments in this BIO after
* physical address coalescing is performed.
*/
unsigned int bi_phys_segments;
unsigned int bi_size; /* residual I/O count */
/*
* To keep track of the max segment size, we account for the
* sizes of the first and last mergeable segments in this bio.
*/
unsigned int bi_seg_front_size;
unsigned int bi_seg_back_size;
unsigned int bi_max_vecs; /* max bvl_vecs we can hold */
unsigned int bi_comp_cpu; /* completion CPU */
atomic_t bi_cnt; /* pin count */
struct bio_vec *bi_io_vec; /* the actual vec list */
bio_end_io_t *bi_end_io;
void *bi_private;
#if defined(CONFIG_BLK_DEV_INTEGRITY)
struct bio_integrity_payload *bi_integrity; /* data integrity */
#endif
bio_destructor_t *bi_destructor; /* destructor */
/*
* We can inline a number of vecs at the end of the bio, to avoid
* double allocations for a small number of bio_vecs. This member
* MUST obviously be kept at the very end of the bio.
*/
struct bio_vec bi_inline_vecs[0];
};
struct bio_vec {
struct page *bv_page;
unsigned int bv_len;
unsigned int bv_offset;
};
该定义中已经有详细的注释了哦。bi_sector 为以 512 字节为单位的扇区地址(即使物理设备的扇区大小不是 512 字节,bi_sector 也以 512 字节为单位)。bi_bdev 为关联的块设备。bi_rw 表示为读请求还是写请求。bi_cnt 为引用计数,通过 bio_get、bio_put 宏可以对 bi_cnt 进行增加和减小操作。当 bi_cnt 值为 0 时,bio 结构就被销毁并且后端的内存也被释放。
I/O 向量:
bio 结构中最重要的是 bi_vcnt、bi_idx、bi_io_vec 等成员,bi_vcnt 为 bi_io_vec 所指向的 bio_vec 类型列表个数,bi_io_vec 表示指定的 block I/O 操作中的单独的段(如果你用过 readv 和 writev 函数那应该对这个比较熟悉),bi_idx 为当前在 bi_io_vec 数组中的索引,随着 block I/O 操作的进行,bi_idx 值被不断更新,kernel 提供 bio_for_each_segment 宏用于遍历 bio 中的 bio_vec。另外 kernel 中的 MD 软件 RAID 驱动也会使用 bi_idx 值来将一个 bio 请求分发到不同的磁盘设备上进行处理。
bio_vec 的定义也在上面的代码中,同样在 <linux/bio.h> 头文件中,每个 bio_vec 类型指向对应的 page,bv_page 表示它所在的页,bv_offset 为块相对于 page 的偏移量,bv_len 即为块的长度。
buffer_head 和 bio 总结:
因此也可以看出 block I/O 请求是以 I/O 向量的形式进行提交和处理的。
bio 相对 buffer_head 的好处有:bio 可以更方便的使用高端内存,因为它只与 page 打交道,并不直接使用地址。bio 可以表示 direct I/O(不经过 page cache,后面再详细描述)。对向量形式的 I/O(包括 sg I/O) 支持更好,防止 I/O 被打散。但是 buffer_head 还是需要的,它用于映射磁盘块到内存,因为 bio 中并没有包含 kernel 需要的 buffer 状态的成员以及一些其它信息。
3、请求队列:
块设备使用请求队列来保存等待中的 block I/O 请求,其使用 request_queue 结构来表示,定义在 <linux/blkdev.h> 头文件中,此头文件中还包含了非常重要的 request 结构:
struct request {
struct list_head queuelist;
struct call_single_data csd;
struct request_queue *q;
unsigned int cmd_flags;
enum rq_cmd_type_bits cmd_type;
unsigned long atomic_flags;
int cpu;
/* the following two fields are internal, NEVER access directly */
unsigned int __data_len; /* total data len */
sector_t __sector; /* sector cursor */
struct bio *bio;
struct bio *biotail;
struct hlist_node hash; /* merge hash */
/*
* The rb_node is only used inside the io scheduler, requests
* are pruned when moved to the dispatch queue. So let the
* completion_data share space with the rb_node.
*/
union {
struct rb_node rb_node; /* sort/lookup */
void *completion_data;
};
/*
* two pointers are available for the IO schedulers, if they need
* more they have to dynamically allocate it.
*/
void *elevator_private;
void *elevator_private2;
struct gendisk *rq_disk;
unsigned long start_time;
/* Number of scatter-gather DMA addr+len pairs after
* physical address coalescing is performed.
*/
unsigned short nr_phys_segments;
unsigned short ioprio;
int ref_count;
void *special; /* opaque pointer available for LLD use */
char *buffer; /* kaddr of the current segment if available */
int tag;
int errors;
/*
* when request is used as a packet command carrier
*/
unsigned char __cmd[BLK_MAX_CDB];
unsigned char *cmd;
unsigned short cmd_len;
unsigned int extra_len; /* length of alignment and padding */
unsigned int sense_len;
unsigned int resid_len; /* residual count */
void *sense;
unsigned long deadline;
struct list_head timeout_list;
unsigned int timeout;
int retries;
/*
* completion callback.
*/
rq_end_io_fn *end_io;
void *end_io_data;
/* for bidi */
struct request *next_rq;
};
struct request_queue
{
/*
* Together with queue_head for cacheline sharing
*/
struct list_head queue_head;
struct request *last_merge;
struct elevator_queue *elevator;
/*
* the queue request freelist, one for reads and one for writes
*/
struct request_list rq;
request_fn_proc *request_fn;
make_request_fn *make_request_fn;
prep_rq_fn *prep_rq_fn;
unplug_fn *unplug_fn;
merge_bvec_fn *merge_bvec_fn;
prepare_flush_fn *prepare_flush_fn;
softirq_done_fn *softirq_done_fn;
rq_timed_out_fn *rq_timed_out_fn;
dma_drain_needed_fn *dma_drain_needed;
lld_busy_fn *lld_busy_fn;
/*
* Dispatch queue sorting
*/
sector_t end_sector;
struct request *boundary_rq;
/*
* Auto-unplugging state
*/
struct timer_list unplug_timer;
int unplug_thresh; /* After this many requests */
unsigned long unplug_delay; /* After this many jiffies */
struct work_struct unplug_work;
struct backing_dev_info backing_dev_info;
/*
* The queue owner gets to use this for whatever they like.
* ll_rw_blk doesn't touch it.
*/
void *queuedata;
/*
* queue needs bounce pages for pages above this limit
*/
gfp_t bounce_gfp;
/*
* various queue flags, see QUEUE_* below
*/
unsigned long queue_flags;
/*
* protects queue structures from reentrancy. ->__queue_lock should
* _never_ be used directly, it is queue private. always use
* ->queue_lock.
*/
spinlock_t __queue_lock;
spinlock_t *queue_lock;
/*
* queue kobject
*/
struct kobject kobj;
/*
* queue settings
*/
unsigned long nr_requests; /* Max # of requests */
unsigned int nr_congestion_on;
unsigned int nr_congestion_off;
unsigned int nr_batching;
void *dma_drain_buffer;
unsigned int dma_drain_size;
unsigned int dma_pad_mask;
unsigned int dma_alignment;
struct blk_queue_tag *queue_tags;
struct list_head tag_busy_list;
unsigned int nr_sorted;
unsigned int in_flight[2];
unsigned int rq_timeout;
struct timer_list timeout;
struct list_head timeout_list;
struct queue_limits limits;
/*
* sg stuff
*/
unsigned int sg_timeout;
unsigned int sg_reserved_size;
int node;
#ifdef CONFIG_BLK_DEV_IO_TRACE
struct blk_trace *blk_trace;
#endif
/*
* reserved for flush operations
*/
unsigned int ordered, next_ordered, ordseq;
int orderr, ordcolor;
struct request pre_flush_rq, bar_rq, post_flush_rq;
struct request *orig_bar_rq;
struct mutex sysfs_lock;
#if defined(CONFIG_BLK_DEV_BSG)
struct bsg_class_device bsg_dev;
#endif
};
request_queue 中的很多成员和 I/O 调度器、request、bio 等息息相关。request_queue 中的 queue_head 成员为请求的双向链表。nr_requests 为请求的数量。I/O 请求被文件系统等上层的代码加入到队列中(需要经过 I/O 调度器,下面会介绍),只要队列不为空,block 设备驱动程序就需要从队列中抓取请求并提交到对应的块设备中。这个队列中的就是单独的请求,以 request 结构来表示。
每个 request 结构又可以由多个 bio 组成,一个 request 中放着顺序排列的 bio(请求在多个连续的磁盘块上)。
实际上在 request_queue 中,只有当请求队列有一定数目的请求时,I/O 调度算法才能发挥作用,否则极端情况下它将退化成 “先来先服务算法”,这就悲催了。通过对 request_queue 进行 plug 操作相当于停用,unplug 相当于恢复。请求少时将request_queue 停用,当请求达到一定数目,或者 request_queue 里最 “老” 的请求已经等待一段时间了才将 request_queue 恢复,这些见 request_queue 中的 unplug_fn、unplug_timer、unplug_thresh、unplug_delay 等成员。
4、I/O 调度器:
I/O 调度器也是 block 层的大头,它肩负着非常重要的使命。由于现在的机械硬盘设备的寻道是非常慢的(常常是毫秒级),因此尽可能的减少寻道操作是提高性能的关键所在。一般 I/O 调度器要做的事情就是在完成现有请求的前提下,让磁头尽可能少移动,从而提高磁盘的读写效率。最有名的就是 “电梯算法” 了。
由于 I/O 调度器的存在,kernel 并不会按实际收到的顺序将请求发到底层设备上,而是经过了合并(减少请求数量和寻道,如果无法合并将请求放在队列尾部)和排序处理(类似电梯的减少往返寻道的处理,也是 I/O 调度器被称为 elevators 的原因)。I/O 调度器就是来管理块设备的请求队列的,它来决定队列中请求的顺序,以及每个请求什么时候到派遣到块设备上。
现在的 Linux kernel 中已经有几种好用的 I/O 调度器,常见的包括 Linus(2.4 版本中的调度器)、cfq(很多发行版中的默认调度器)、deadline、noop、anticipatory(相对 deadline 的优化) 等。
Linus 调度器同时实现了合并和排序处理,而且是 front merging(新请求在当前的前面) 和 back merging(新请求在当前的后面,当然比 front merging 常见) 都支持的,并且有一定的请求时限处理。
deadline 调度器主要解决 Linus 调度器导致的请求饥饿问题(不能及时有效的被处理),deadline 调度器保证请求的开始服务时间。另外 deadline 解决了写请求(一般为异步处理)使读请求(一般为同步处理)不能被及时处理的问题,也就是解决读延迟。
noop 调度器几乎保持原始请求顺序不变(仍然有合并),而 cfq 则提供类似完全公平的调度策略。
总之不同的 I/O 调度器通常是对于特定类型的请求进行优化的,有关这些调度器的具体实现,之后将专门写文章来介绍它们,这里就不会熬述咯。
I/O 调度的一些数据结构声明在 <linux/elevator.h> 头文件中,比较重要的包括 elevator_ops、elevator_type 以及 elevator_queue 等。elevator_ops 中定义了 I/O 调度算法的各种操作函数接口。
struct elevator_ops
{
elevator_merge_fn *elevator_merge_fn;
elevator_merged_fn *elevator_merged_fn;
elevator_merge_req_fn *elevator_merge_req_fn;
elevator_allow_merge_fn *elevator_allow_merge_fn;
elevator_dispatch_fn *elevator_dispatch_fn;
elevator_add_req_fn *elevator_add_req_fn;
elevator_activate_req_fn *elevator_activate_req_fn;
elevator_deactivate_req_fn *elevator_deactivate_req_fn;
elevator_queue_empty_fn *elevator_queue_empty_fn;
elevator_completed_req_fn *elevator_completed_req_fn;
elevator_request_list_fn *elevator_former_req_fn;
elevator_request_list_fn *elevator_latter_req_fn;
elevator_set_req_fn *elevator_set_req_fn;
elevator_put_req_fn *elevator_put_req_fn;
elevator_may_queue_fn *elevator_may_queue_fn;
elevator_init_fn *elevator_init_fn;
elevator_exit_fn *elevator_exit_fn;
void (*trim)(struct io_context *);
};
struct elevator_type
{
struct list_head list;
struct elevator_ops ops;
struct elv_fs_entry *elevator_attrs;
char elevator_name[ELV_NAME_MAX];
struct module *elevator_owner;
};
struct elevator_queue
{
struct elevator_ops *ops;
void *elevator_data;
struct kobject kobj;
struct elevator_type *elevator_type;
struct mutex sysfs_lock;
struct hlist_head *hash;
};
elevator_type 用来描述不同的 I/O 调度器,你可以在 request_queue 的声明中看到 elevator_queue 的身影。
5、块设备请求处理:
当需要发起块设备读写请求时,kernel 首先根据需求构造 bio 结构(毕竟是 I/O 请求单位哦),其中包含了读写的地址、长度、设备、回调函数等信息,然后 kernel 通过 submit_bio 函数将请求转发给块设备,看看 submit_bio 的实现(在 block/blk-core.c 中,下面几个非常重要的函数也在这个关键的 block 层实现文件中):
void submit_bio(int rw, struct bio *bio)
{
int count = bio_sectors(bio);
bio->bi_rw |= rw;
/*
* If it's a regular read/write or a barrier with data attached,
* go through the normal accounting stuff before submission.
*/
if (bio_has_data(bio)) {
if (rw & WRITE) {
count_vm_events(PGPGOUT, count);
} else {
task_io_account_read(bio->bi_size);
count_vm_events(PGPGIN, count);
}
if (unlikely(block_dump)) {
char b[BDEVNAME_SIZE];
printk(KERN_DEBUG "%s(%d): %s block %Lu on %s\n",
current->comm, task_pid_nr(current),
(rw & WRITE) ? "WRITE" : "READ",
(unsigned long long)bio->bi_sector,
bdevname(bio->bi_bdev, b));
}
}
generic_make_request(bio);
}
submit_bio 的输入参数为 bio 结构,submit_bio 最终会调用 generic_make_request 函数不断转发 bio 请求:
static inline void __generic_make_request(struct bio *bio)
{
struct request_queue *q;
sector_t old_sector;
int ret, nr_sectors = bio_sectors(bio);
dev_t old_dev;
int err = -EIO;
might_sleep();
if (bio_check_eod(bio, nr_sectors))
goto end_io;
/*
* Resolve the mapping until finished. (drivers are
* still free to implement/resolve their own stacking
* by explicitly returning 0)
*
* NOTE: we don't repeat the blk_size check for each new device.
* Stacking drivers are expected to know what they are doing.
*/
old_sector = -1;
old_dev = 0;
do {
char b[BDEVNAME_SIZE];
q = bdev_get_queue(bio->bi_bdev);
if (unlikely(!q)) {
printk(KERN_ERR
"generic_make_request: Trying to access "
"nonexistent block-device %s (%Lu)\n",
bdevname(bio->bi_bdev, b),
(long long) bio->bi_sector);
goto end_io;
}
if (unlikely(!bio_rw_flagged(bio, BIO_RW_DISCARD) &&
nr_sectors > queue_max_hw_sectors(q))) {
printk(KERN_ERR "bio too big device %s (%u > %u)\n",
bdevname(bio->bi_bdev, b),
bio_sectors(bio),
queue_max_hw_sectors(q));
goto end_io;
}
if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
goto end_io;
if (should_fail_request(bio))
goto end_io;
/*
* If this device has partitions, remap block n
* of partition p to block n+start(p) of the disk.
*/
blk_partition_remap(bio);
if (bio_integrity_enabled(bio) && bio_integrity_prep(bio))
goto end_io;
if (old_sector != -1)
trace_block_remap(q, bio, old_dev, old_sector);
old_sector = bio->bi_sector;
old_dev = bio->bi_bdev->bd_dev;
if (bio_check_eod(bio, nr_sectors))
goto end_io;
if (bio_rw_flagged(bio, BIO_RW_DISCARD) &&
!blk_queue_discard(q)) {
err = -EOPNOTSUPP;
goto end_io;
}
trace_block_bio_queue(q, bio);
ret = q->make_request_fn(q, bio);
} while (ret);
return;
end_io:
bio_endio(bio, err);
}
void generic_make_request(struct bio *bio)
{
struct bio_list bio_list_on_stack;
if (current->bio_list) {
/* make_request is active */
bio_list_add(current->bio_list, bio);
return;
}
/* following loop may be a bit non-obvious, and so deserves some
* explanation.
* Before entering the loop, bio->bi_next is NULL (as all callers
* ensure that) so we have a list with a single bio.
* We pretend that we have just taken it off a longer list, so
* we assign bio_list to a pointer to the bio_list_on_stack,
* thus initialising the bio_list of new bios to be
* added. __generic_make_request may indeed add some more bios
* through a recursive call to generic_make_request. If it
* did, we find a non-NULL value in bio_list and re-enter the loop
* from the top. In this case we really did just take the bio
* of the top of the list (no pretending) and so remove it from
* bio_list, and call into __generic_make_request again.
*
* The loop was structured like this to make only one call to
* __generic_make_request (which is important as it is large and
* inlined) and to keep the structure simple.
*/
BUG_ON(bio->bi_next);
bio_list_init(&bio_list_on_stack);
current->bio_list = &bio_list_on_stack;
do {
__generic_make_request(bio);
bio = bio_list_pop(current->bio_list);
} while (bio);
current->bio_list = NULL; /* deactivate */
}
generic_make_request 中获取 bio 指向的块设备的请求队列,并循环通过 __generic_make_request 调用请求队列的 make_request_fn 方法(见 request_queue 的声明,里面定义了一系列的函数指针)来下发 bio。
普通的块设备处理中一般会将 __make_request 函数注册到请求队列的 make_request_fn 函数指针上。另外设备驱动程序也可以注册自己的 I/O 提交等函数,这样可以绕过 Linux 默认提供的 I/O 协议栈,不走标准的 I/O 请求队列,由驱动程序自己来处理,有很多 nvram、SSD 卡等的驱动程序会为了提高性能而做这样的处理。
来看看 __make_request 的处理:
static int __make_request(struct request_queue *q, struct bio *bio)
{
struct request *req;
int el_ret;
unsigned int bytes = bio->bi_size;
const unsigned short prio = bio_prio(bio);
const bool sync = bio_rw_flagged(bio, BIO_RW_SYNCIO);
const bool unplug = bio_rw_flagged(bio, BIO_RW_UNPLUG);
const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK;
int rw_flags;
if (bio_rw_flagged(bio, BIO_RW_BARRIER) &&
(q->next_ordered == QUEUE_ORDERED_NONE)) {
bio_endio(bio, -EOPNOTSUPP);
return 0;
}
/*
* low level driver can indicate that it wants pages above a
* certain limit bounced to low memory (ie for highmem, or even
* ISA dma in theory)
*/
blk_queue_bounce(q, &bio);
spin_lock_irq(q->queue_lock);
if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER)) || elv_queue_empty(q))
goto get_rq;
el_ret = elv_merge(q, &req, bio);
switch (el_ret) {
case ELEVATOR_BACK_MERGE:
BUG_ON(!rq_mergeable(req));
if (!ll_back_merge_fn(q, req, bio))
break;
trace_block_bio_backmerge(q, bio);
if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
blk_rq_set_mixed_merge(req);
req->biotail->bi_next = bio;
req->biotail = bio;
req->__data_len += bytes;
req->ioprio = ioprio_best(req->ioprio, prio);
if (!blk_rq_cpu_valid(req))
req->cpu = bio->bi_comp_cpu;
drive_stat_acct(req, 0);
if (!attempt_back_merge(q, req))
elv_merged_request(q, req, el_ret);
goto out;
case ELEVATOR_FRONT_MERGE:
BUG_ON(!rq_mergeable(req));
if (!ll_front_merge_fn(q, req, bio))
break;
trace_block_bio_frontmerge(q, bio);
if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff) {
blk_rq_set_mixed_merge(req);
req->cmd_flags &= ~REQ_FAILFAST_MASK;
req->cmd_flags |= ff;
}
bio->bi_next = req->bio;
req->bio = bio;
/*
* may not be valid. if the low level driver said
* it didn't need a bounce buffer then it better
* not touch req->buffer either...
*/
req->buffer = bio_data(bio);
req->__sector = bio->bi_sector;
req->__data_len += bytes;
req->ioprio = ioprio_best(req->ioprio, prio);
if (!blk_rq_cpu_valid(req))
req->cpu = bio->bi_comp_cpu;
drive_stat_acct(req, 0);
if (!attempt_front_merge(q, req))
elv_merged_request(q, req, el_ret);
goto out;
/* ELV_NO_MERGE: elevator says don't/can't merge. */
default:
;
}
get_rq:
/*
* This sync check and mask will be re-done in init_request_from_bio(),
* but we need to set it earlier to expose the sync flag to the
* rq allocator and io schedulers.
*/
rw_flags = bio_data_dir(bio);
if (sync)
rw_flags |= REQ_RW_SYNC;
/*
* Grab a free request. This is might sleep but can not fail.
* Returns with the queue unlocked.
*/
req = get_request_wait(q, rw_flags, bio);
/*
* After dropping the lock and possibly sleeping here, our request
* may now be mergeable after it had proven unmergeable (above).
* We don't worry about that case for efficiency. It won't happen
* often, and the elevators are able to handle it.
*/
init_request_from_bio(req, bio);
spin_lock_irq(q->queue_lock);
if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) ||
bio_flagged(bio, BIO_CPU_AFFINE))
req->cpu = blk_cpu_to_group(smp_processor_id());
if (queue_should_plug(q) && elv_queue_empty(q))
blk_plug_device(q);
add_request(q, req);
out:
if (unplug || !queue_should_plug(q))
__generic_unplug_device(q);
spin_unlock_irq(q->queue_lock);
return 0;
}
I/O 调度器的合并处理就在 __make_request 中通过调用相应调度器的函数来完成。__make_request 调用 elv_merge 通过调度器判断是否可以合并,如果可以则根据 front merging 或者 back merging 分别由调度器做处理。如果不能合并则调用 get_request_wait 和 init_request_from_bio 根据 bio 请求创建并初始化新的 request,然后调用 add_request 将 request 加入请求队列。
static inline void add_request(struct request_queue *q, struct request *req)
{
drive_stat_acct(req, 1);
/*
* elevator indicated where it wants this request to be
* inserted at elevator_merge time
*/
__elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0);
}
add_request 还是会通过调度器将 request 插入请求队列中合适的位置。
最终 __make_request 返回 0 表示 bio 转发结束。后续 request 的处理方法和设备驱动的实现有关,一般通过注册到 request_queue 的 request_fn 函数指针进行处理,例如常见的 SCSI 设备就会将 scsi_request_fn 注册到 request_fn 上。驱动程序中请求发送以及自己的队列等处理完毕后调用 blk_complete_request 结束请求,而在结束请求过程中会调用 bio 的回调函数结束 bio。
本文只是对 Linux block 层做了基本的介绍,类似 buffer_head 处理、同步异步 I/O 处理等很多都没有涉及,以后再专门来研究了,文章有任何问题欢迎指正哦 ^_^