#%include "default.mgp" %tab 1 size 4, vgap 40, prefix " ", icon box "green" 50 %tab 2 size 4, vgap 40, prefix " ", icon arc "yellow" 50 %tab 3 size 4, vgap 40, prefix " ", icon delta3 "white" 40 %%%%%%%%%%%%% %page %tfont "/usr/share/fonts/default/TrueType/timr____.ttf" %fore "white" %center Asynchronous IO for Linux %fore "blue" %size 6 Benjamin LaHaise %size 4 bcrl@redhat.com {bcrl,blah,kernel}@kvack.org %left %size 6 %%%%%%%%%%%% %page %fore "white" Overview %fore "skyblue" why do we need AIO? implementation userland API kernel API kernel extensions current status / todo # file operations # wait queue funcs # work to dos # page wait # page locking # buffer wait # buffer locking # semaphores # raw devices # generic page cache aio # networking # pipes # overhead # %%%%%%%%%%%% %page %fore "white" Why AIO? %fore "skyblue" useful in event based applications required to make efficient use of raw io sematics fit with ''zero copy" significantly lower overhead for daemons vs select/poll scalable for large numbers of connections %%%%%%%%%%%%%%%%%%%%% %page %fore "white" AIO Design Overview - User APIs %fore "skyblue" event driven: presents completed events to the user new device: /dev/aio context for io events handle for mmap new syscalls int __io_wait(int ctx_fd, int event_idx, void *key, struct timespec *timeout); int __io_cancel(int ctx_fd, int event_idx, void *key); int __io_getevents(int ctx_fd, struct io_event *events, int nr, struct timespec *timeout); int __submit_ios(int ctx_fd, int nr, struct iocb **iocbs); %%%%%%%%%%%%%%%%%%%%% %page %fore "white" AIO Design Overview - Kernel APIs %fore "skyblue" make use of existing asynchronous in kernel APIs where possible, fixing bugs as needed readpage/writepage submit_bh in-kernel API currently uses int rw_kiovec(struct file *filp, int rw, struct kiovec *iovec, size_t size, loff_t pos); probable alternative int io_submit(struct file *filp, struct iocb *iocb); %%%%%%%%%%%%%%%%%%%%% %page %fore "white" wait_queue_t extension %fore "skyblue" typedef void (*wait_queue_func_t)(wait_queue_t *wait) simple extension to wait queues: adds a wait_queue_func_t func; to wait_queue_t void init_waitqueue_func_entry(wait_queue_t *q, wait_queue_func_t func); provides a threadless means of receiving events in kernel lower overhead == better scalability works with existing wake all and exclusive semantics must be very careful about acquiring locks %fore "red" Note! Use of waitqueue_active introduces races! %fore "white" %%%%%%%%%%%%%%%%%%%%% %page %fore "white" Work To Dos %fore "skyblue" based on ideas from Jeff Merkey context for state machine execution allows sharing of a pool of worker threads fully non blocking primatives based on task queues + wait queue %%%%%%%%%%%%%%%%%%%%% %page %fore "white" Work To Dos %fore "skyblue" primatives for state driven code void wtd_set_action(struct worktodo *wtd, void (*f)(void *d), void *d); void wtd_queue(struct worktodo *wtd); void wtd_wait_page(struct worktodo *wtd, struct page *page); void wtd_lock_page(struct worktodo *wtd, struct page *page); void wtd_wait_on_buffer(struct worktodo *wtd, struct page *page); void wtd_lock_buffer(struct worktodo *wtd, struct page *page); void wtd_down(struct worktodo *wtd, struct page *page); %%%%%%%%%%%%%%%%%%%%%% %page %fore "white" Current Status %fore "skyblue" current patch is ~3000 line patch read/write implemented for: raw io generic page cache (write back and partial O_DSYNC) uses wtd operations tested and works well - nicely performant on RAID0 :-) %%%%%%%%%%%%%%%%%%%%%% %page %fore "white" ToDo %fore "skyblue" write documentation improve glibc code accounting limit kernel memory pinning outstanding io limits dirty buffer pressure socket send/receive integrate davem's single copy pipe code O_DIRECT support %%%%%%%%%%%%%%%%%%%%%% %page %fore "white" Current Status %fore "skyblue" Mailing list: linux-aio@kvack.org echo subscribe | mail -s foo majordomo@kvack.org Download: http://people.redhat.com/bcrl/ http://www.kvack.org/~blah/