From <>  Tue Aug 23 12:34:00 2005
Received: ("??"@kanga.kvack.org) by kvack.org id <S26679AbVHWQdk>;
	Tue, 23 Aug 2005 12:33:40 -0400
To:	owner-linux-aio@kvack.org
From:	The Post Office <postmaster>
Sender:	mailer-daemon
Subject: Delivery reports about your email [FAILED(1)]
MIME-Version: 1.0
Content-Type: multipart/report; report-type=delivery-status;
	boundary="S26679AbVHWQdk=_/kanga.kvack.org"
Message-Id: <20050823163350Z26679-957+408@kvack.org>
Date:	Tue, 23 Aug 2005 12:33:40 -0400
Return-Path: <>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=0.93.5
Status: RO
Content-Length: 14623
Lines: 394

This is MULTIPART/REPORT structured message as defined at RFC 1894.

Ask your email client software vendor, when will they support this
report format by showing its formal part in your preferred language.

--S26679AbVHWQdk=_/kanga.kvack.org
Content-Type: text/plain

This is a collection of reports about email delivery
process concerning a message you originated.

Some explanations/translations for these reports
can be found at:
      http://www.zmailer.org/delivery-report-decoding.html

FAILED:
  Original Recipient:
    rfc822;bcrl@linux.intel.com
  Control data:
    smtp linux.intel.com bcrl@linux.intel.com 99
  Diagnostic texts:
...\
    <<- MAIL From:<owner-linux-aio@kvack.org> BODY=8BITMIME SIZE=10518
    ->> 250 2.1.0 <owner-linux-aio@kvack.org>... Sender ok
    <<- RCPT To:<bcrl@linux.intel.com> NOTIFY=FAILURE ORCPT=rfc822;bcrl@linux.intel.com
    ->> 250 2.1.5 <bcrl@linux.intel.com>... Recipient ok
    <<- DATA
    ->> 354 Enter mail, end with "." on a line by itself
    <<- .
    ->> 554 5.7.1 Message seems to be spam, rejected. Your message was not received by Intel. For more information, see http://www.intel.com/intel/sss/en/whitelist.htm .  Japanese page: Intel SPAM error click here: http://www.intel.com/intel/sss/jp/whitelist.htm

Following is a copy of MESSAGE/DELIVERY-STATUS format section below.
It is copied here in case your email client is unable to show it to you.
The information here below is in  Internet Standard  format designed to
assist automatic, and accurate presentation and usage of said information.
In case you need human assistance from the Postmaster(s) of the system which
sent you this report, please include this information in your question!

    Virtually Yours,
        Automatic Email Delivery Software

Reporting-MTA: dns; kanga.kvack.org
Arrival-Date: Tue, 23 Aug 2005 12:18:06 -0400
Local-Spool-ID: S26585AbVHWQSG

Original-Recipient: rfc822;bcrl@linux.intel.com
Final-Recipient: RFC822;bcrl@linux.intel.com
Action: failed
Status: 5.7.1
Diagnostic-Code: smtp; 554 (Message seems to be spam, rejected. Your message was not received by Intel. For more information, see http://www.intel.com/intel/sss/en/whitelist.htm .  Japanese page: Intel SPAM error click here: http://www.intel.com/intel/sss/jp/whitelist.htm)
Remote-MTA: dns; jf-in.intel.com (134.134.136.19|25|66.96.29.28|52519)
Last-Attempt-Date: Tue, 23 Aug 2005 12:18:42 -0400
X-ZTAID: smtp[26313]


Following is copy of the message headers. Original message content may
be in subsequent parts of this MESSAGE/DELIVERY-STATUS structure.

Received: (root@kanga.kvack.org) by kvack.org id <S26585AbVHWQSG>;
	Tue, 23 Aug 2005 12:18:06 -0400
Received: from Tradebot-Systems-1096753.cust-rtr.swbell.net ([68.90.170.2]:39360
	"EHLO tbmail.tradebot.com") by kvack.org with ESMTP
	id <S26679AbVHWQR7> convert rfc822-to-8bit; Tue, 23 Aug 2005 12:17:59 -0400
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 8BIT
Subject: RE: aio newbie question
X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0
Date:	Tue, 23 Aug 2005 11:17:53 -0500
Message-ID: <07774C6E31D94A44A2A60E2085944F0947C121@tbmail.tradebot.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: aio newbie question
Thread-Index: AcWn4mYKx1G2Z/jMQAmapkQ+EwRtxgAFwb8w
From:	"Paul Rose" <prose@tradebotsystems.com>
To:	<linux-aio@kvack.org>
Cc:	<suparna@in.ibm.com>, <mason@suse.com>
Original-Recipient: rfc822;linux-aio@kvack.org
Sender:	owner-linux-aio@kvack.org
Precedence: bulk
X-Loop:	owner-majordomo@kvack.org


--S26679AbVHWQdk=_/kanga.kvack.org
Content-Type: message/delivery-status

Reporting-MTA: dns; kanga.kvack.org
Arrival-Date: Tue, 23 Aug 2005 12:18:06 -0400
Local-Spool-ID: S26585AbVHWQSG

Original-Recipient: rfc822;bcrl@linux.intel.com
Final-Recipient: RFC822;bcrl@linux.intel.com
Action: failed
Status: 5.7.1
Diagnostic-Code: smtp; 554 (Message seems to be spam, rejected. Your message was not received by Intel. For more information, see http://www.intel.com/intel/sss/en/whitelist.htm .  Japanese page: Intel SPAM error click here: http://www.intel.com/intel/sss/jp/whitelist.htm)
Remote-MTA: dns; jf-in.intel.com (134.134.136.19|25|66.96.29.28|52519)
Last-Attempt-Date: Tue, 23 Aug 2005 12:18:42 -0400
X-ZTAID: smtp[26313]

--S26679AbVHWQdk=_/kanga.kvack.org
Content-Type: message/rfc822

Received: (root@kanga.kvack.org) by kvack.org id <S26585AbVHWQSG>;
	Tue, 23 Aug 2005 12:18:06 -0400
Received: from Tradebot-Systems-1096753.cust-rtr.swbell.net ([68.90.170.2]:39360
	"EHLO tbmail.tradebot.com") by kvack.org with ESMTP
	id <S26679AbVHWQR7> convert rfc822-to-8bit; Tue, 23 Aug 2005 12:17:59 -0400
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 8BIT
Subject: RE: aio newbie question
X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0
Date:	Tue, 23 Aug 2005 11:17:53 -0500
Message-ID: <07774C6E31D94A44A2A60E2085944F0947C121@tbmail.tradebot.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: aio newbie question
Thread-Index: AcWn4mYKx1G2Z/jMQAmapkQ+EwRtxgAFwb8w
From:	"Paul Rose" <prose@tradebotsystems.com>
To:	<linux-aio@kvack.org>
Cc:	<suparna@in.ibm.com>, <mason@suse.com>
Original-Recipient: rfc822;linux-aio@kvack.org
Sender:	owner-linux-aio@kvack.org
Precedence: bulk
X-Loop:	owner-majordomo@kvack.org

Suparna,

I've had a look in my /usr/src/linux (which is linux-2.6.11.4-21.7 on
SUSE 9.3)

It does not seem like it has the changes that exist in your
http://www.kernel.org/pub/linux/kernel/people/suparna/aio/2610-rc2/aiowr
.patch.

(i.e. in filemap.c wait_on_page_writeback_range does not delegate to
__wait_on_page_writeback_range with an added wait_queue).

I'm not really confident I've identified this correctly as it is my
first peek into kerel code.

I don't really know why aio-stress timings looked like there might be
some async behaviour for O_SYNC (submit average < complete average).

My test program shows night and day difference with regard to async
behaviour between O_DIRECT vs O_SYNC (the ony difference being the mode
flag and using posix_memalign to get the buffer for O_DIRECT).  The
io_submits take all the time (even with 40K writes) and io_getevents
chews through 100% of the "pending" writes without having to wait.

When SUSE makes your O_SYNC pathces available for 9.3 I'll revisit this
again, for now I'll use O_DIRECT.  (I don't think I can get approval to
run our system on a self-patched kernel).

Thanks for all your help,
Paul Rose

PS:

I've included the source to my test program below.  
It is possible my analysis above is wrong and my test code is just doing
something stupid, but the test code is pretty simple.

Please don't feel obligated to review it, but if you feel so inclined:

Original error return vs. errno problem fixed. 
Converted to straight "C", 108 lines.

build: gcc -laio -o simple simple.c
run:  ./simple /tmp/foo.dat 1250 40960 Q O_DIRECT 
  (for 1250 records of size 40K filled with letter Q with mode O_DIRECT)
run:  ./simple /tmp/foo.dat 1250 40960 Q O_SYNC 
  (for 1250 records of size 40K filled with letter R with mode O_SYNC)

test output:
 ~/testaio$ ./simple /tmp/foo.dat 1250 40960 Q O_SYNC
 writing 1250 records of size 40960 to /tmp/foo.dat (O_SYNC)
 1250 written, 1250 completed, 0 pending
 47.568 prepTime, 1164.491 submitTime, 1.791 waitTime, 1213.850
totalTime

 ~/testaio$ ./simple /tmp/foo.dat 1250 40960 Q O_DIRECT
 writing 1250 records of size 40960 to /tmp/foo.dat (O_DIRECT)
 1250 written, 1250 completed, 0 pending
 93.419 prepTime, 134.677 submitTime, 992.670 waitTime, 1220.766
totalTime

run it once to pre-allocate the file (ignoring results).
later runs will overwrite the same file (i.e. there is no O_TRUNC).
output times show sum-total time (msec.usec) for all io_prep_pwrite,
io_submits and all io_getevents.
higher prep times for O_DIRECT are (I think) because of posix_memalign
vs malloc.

uncommenting the in-loop printf shows that pending is always zero for
O_SYNC but stays above zero for O_DIRECT.

///////simple.c
#define _XOPEN_SOURCE 600
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/time.h>
#include <fcntl.h>
#include <unistd.h>
#include <libaio.h>
#ifndef O_DIRECT
#define O_DIRECT         040000
#endif

double usec(){
    struct timeval tv;
    gettimeofday(&tv,0);
    return (double)tv.tv_sec*1000000.0 + (double)tv.tv_usec;
}
int die(const char *where, int error){
    fprintf(stderr, "Error %s: in %s\n", strerror(error), where);
    exit(1);
}

enum { max_pending = 128 }; //try to keep this many writes pending

int main(int argc, char *const *argv){
    const char *file= argc > 1 ? argv[1]       : "/tmp/foo.dat";
    int  numRecords = argc > 2 ? atoi(argv[2]) : 10000;
    int  recordSize = argc > 3 ? atoi(argv[3]) : 200;
    char fillChar   = argc > 4 ? argv[4][0]    : 'Z';
    int  mode       = O_SYNC;
    if(argc > 5){
        if(strcmp(argv[5], "O_DIRECT") == 0) mode = O_DIRECT;
        if(strcmp(argv[5], "O_SYNC") == 0) mode = O_SYNC;
    }
    if(argc > 6 || !fillChar || !file || numRecords <= 0 || recordSize
<= 0){
        fprintf(stderr,"usage:  testaio file numRecords recordSize
fillChar mode\n");
        exit(1);
    }
    if((mode&O_DIRECT) && (recordSize % 512) != 0){
        printf("recordSize must be multiple of 512 for direct\n");
        exit(1);
    }
    int fd = open(file,O_CREAT|O_WRONLY|mode,0666);
    if(fd < 0) die("open", errno);

    printf("writing %d records of size %d to %s (%s)\n", numRecords,
recordSize, file, (mode&O_DIRECT)?"O_DIRECT":"O_SYNC");

    io_context_t myctx;
    memset(&myctx, 0, sizeof(myctx));
    int ret;
    if((ret=io_queue_init(max_pending, &myctx))!=0) die("io_queue_init",
-ret);

    double prepTime=0, submitTime = 0, completeTime = 0, totalTime = 0;
    int pending=0, completed=0, written=0;
    while(completed < numRecords){
        struct iocb *iocb_list[max_pending];
        double s0 = usec();
        int i;
        for(i=0; written<numRecords && pending<max_pending; ++i){
            iocb_list[i] = malloc(sizeof(struct iocb));

            char * buffer = 0;
            if(mode&O_DIRECT){
                ret = posix_memalign((void**)&buffer, 512, recordSize);
                if(ret != 0) die("posix_memalign", ret);
            }
            else{
                buffer = (char*)malloc(recordSize);
            }
            if(!buffer) die("alloc", ENOMEM);
            memset(buffer, fillChar, recordSize-1); /* fill buffer with
fillchar + newline */
            buffer[recordSize-1] = '\n';
             io_prep_pwrite(iocb_list[i], fd, buffer, recordSize,
written*recordSize);
            iocb_list[i]->data = (void*)written;
            ++written;
            ++pending;
        }
        double s1 = usec(); prepTime += s1-s0;
        double s2 = s1;
        if(i > 0){
            if((ret=io_submit(myctx, i, iocb_list)) < 0)
die("io_submit", ret);
            s2 = usec(); submitTime += s2-s1;
        }

        struct io_event event;
        struct timespec to = {0,0};

        while((ret = io_getevents(myctx, 1, 1, &event,
pending==max_pending?0:&to))==1){
            if(event.res2 != 0) die("event.res2 != 0", 0);
            if(event.res != event.obj->u.c.nbytes) die("event.res",
-event.res);

            ++completed;
            --pending;

            free(event.obj->u.c.buf);
            free(event.obj);
        }
        /*printf("%d written, %d completed, %d pending\n", written,
completed, pending);*/
        if(ret < 0) die("io_getevents", -ret);
        double s3 = usec(); completeTime += s3 - s2;
        totalTime += s3-s0;
    }
    printf("%d written, %d completed, %d pending\n", written, completed,
pending);
    printf("%.3f prepTime, %.3f submitTime, %.3f waitTime, %.3f
totalTime\n",
        prepTime/1000.0, submitTime/1000.0, completeTime/1000.0,
totalTime/1000.0);
    close(fd);
    return 0;
}
-----Original Message-----
From: Suparna Bhattacharya [mailto:suparna@in.ibm.com] 
Sent: Tuesday, August 23, 2005 8:08 AM
To: Paul Rose
Cc: linux-aio@kvack.org; mason@suse.com
Subject: Re: aio newbie question


On Tue, Aug 23, 2005 at 07:31:37AM -0500, Paul Rose wrote:
> Suparna,
> 
> Here are the results of running aio-stress:
> 
>  prose03:~/testaio$ ./aio-stress -S -I 10240 -l -L -r 1 -t 1 -o 0
> /tmp/foo.dat
>  adding stage write
>  starting with write
>  file size 1024MB, record size 1KB, depth 64, ios per iteration 8
>  max io_submit 8, buffer alignment set to 4KB
>  threads 1 files 1 contexts 1 context offset 2MB verification off
>  Running single thread version
>  latency min 0.86 avg 6.81 max 61.29
>           10240 < 100 0 < 250 0 < 500 0 < 1000 0 < 5000 0 < 10000
>  completion latency min 0.01 avg 23.26 max 100.80
>           81848 < 100 8 < 250 0 < 500 0 < 1000 0 < 5000 0 < 10000
>  global num pending is 64
>  prose03:~/testaio$
> 
> I can re-run it with different parameters if needed.  
> I ran:
>   -r 1 (since my program will be doing small IO's)
>   -t 1 (since my program will be doing disk IO from one thread)
>   -o 0 (since my program will do non-aio reads at startup to build
> state, but will normally just write)
>   
> I'm not sure I'm reading the output right, but it looks like the
> submission latency average is much less than the completion latency
> average.  In other words, it is working.  I'll have to examine what
> aio-stress is doing that my sample program is not in order to
understand
> why.
> 
> I spotted some small problems with my test program that I fixed,
without
> changing the outcome (I wasn't handling return values vs. errno
> properly, and the last io_getevents() that tried to wait without
getting
> an event was wrong-headed).
> 
> The main difference I can think of is that my programs did smaller
IO's
> whose length, offset and buffer alignment did not align on 512 byte
> boundaries.  But with non O_DIRECT this should not be a factor, right?

For smaller IOs (i.e. less than the filesystem block size), the OS has
to read in the page from disk before modifying it and issuing the IO -
perhaps that could have an impact. What is your filesystem block size ?

BTW, do you really need to do such small IOs, or was this just a test
case of
sorts ? 

What happens when you try your test program with much larger IOs ?

Regards
Suparna
--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>--S26679AbVHWQdk=_/kanga.kvack.org--

