Re: [Caml-list] Severe loss of performance due to new signal handling (fwd)

Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed

* Re: [Caml-list] Severe loss of performance due to new signal handling (fwd)
@ 2006-03-21 23:53 Brian Hurt
  2006-03-22  9:20 ` Alexander S. Usov
  2006-03-22 10:56 ` Robert Roessler
  0 siblings, 2 replies; 3+ messages in thread
From: Brian Hurt @ 2006-03-21 23:53 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: TEXT/PLAIN, Size: 968 bytes --]


Opps- didn't intend this message to be off-list.

---------- Forwarded message ----------
Date: Tue, 21 Mar 2006 16:32:51 -0600 (CST)
From: Brian Hurt <bhurt@spnz.org>
To: Robert Roessler <roessler@rftp.com>
Subject: Re: [Caml-list] Severe loss of performance due to new signal handling



On Tue, 21 Mar 2006, Robert Roessler wrote:

> Well, I *thought* there was a marked absence of "bit-level parallelism" in 
> the signal-handling... ;)
> 
> So the "expense" of individual atomic operations is not really what is at the 
> heart of this performance problem...

Hmm.  Maybe not.  I'm measuring a 4 clock cycle cost for a xchgl, both with and 
without a lock on my Athlon XP 1.8GHz.  See attached code. Naturally, this is a 
uniprocessor machine and the memory location is in L1 cache (or will be soon), 
and no contention, so this is definately best case.  4 clocks is about rights 
for a read and a write to L1 cache (each L1 cache access taking 2 clocks).

Brian

[-- Attachment #2: Type: TEXT/X-CSRC, Size: 1369 bytes --]

#include <stdio.h>
#include <pthread.h>
#include <semaphore.h>

#if !defined(__GNUC__) && !defined(__i386__)
#error This code only works with GCC/i386.
#endif

/* The reason this only works under GNU C and the x86 is we're using the
 * rdtsc instruction.
 */
static inline unsigned long long rdtsc() {
	unsigned long long rval;

	asm volatile ("rdtsc" : "=A" (rval));
	return rval;
}

static inline unsigned long read_and_clear(unsigned long * ptr) {
    unsigned long rval;

	asm volatile ("lock xchgl %0, %1" : "=r" (rval), "+m" (*ptr) : "0" (0));
    return rval;
}


int main(void) {
	int i;
    volatile unsigned long trash;
    unsigned long val = 1;
	unsigned long long start, stop, time, min;

	/* Time how long a rdtsc takes- we do this ten times and take the 
	 * cheapest run.
	 */
	min = ~0ull;
	for (i = 0; i < 10; ++i) {
		start = rdtsc();
		trash = 0;
		stop = rdtsc();
		time = stop - start;
		if (time < min) {
			min = time;
		}
	}
	printf("Minimum time for a rdtsc instruction (in clocks): %llu\n", min);

	min = ~0ull;
	for (i = 0; i < 10; ++i) {
		val = 1;
		start = rdtsc();
		trash = read_and_clear(&val);
		stop = rdtsc();
		time = stop - start;
		if (time < min) {
			min = time;
		}
	}
	printf("Minimum time for a read_and_clear() + rdtsc (in clocks): %llu\n", min);
	return 0;
}


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Caml-list] Severe loss of performance due to new signal handling (fwd)
  2006-03-21 23:53 [Caml-list] Severe loss of performance due to new signal handling (fwd) Brian Hurt
@ 2006-03-22  9:20 ` Alexander S. Usov
  2006-03-22 10:56 ` Robert Roessler
  1 sibling, 0 replies; 3+ messages in thread
From: Alexander S. Usov @ 2006-03-22  9:20 UTC (permalink / raw)
  To: caml-list

On Wednesday 22 March 2006 00:53, Brian Hurt wrote:
> Opps- didn't intend this message to be off-list.
>
> ---------- Forwarded message ----------
> Date: Tue, 21 Mar 2006 16:32:51 -0600 (CST)
> From: Brian Hurt <bhurt@spnz.org>
> To: Robert Roessler <roessler@rftp.com>
> Subject: Re: [Caml-list] Severe loss of performance due to new signal
> handling
>
> On Tue, 21 Mar 2006, Robert Roessler wrote:
> > Well, I *thought* there was a marked absence of "bit-level parallelism"
> > in the signal-handling... ;)
> >
> > So the "expense" of individual atomic operations is not really what is at
> > the heart of this performance problem...
>
> Hmm.  Maybe not.  I'm measuring a 4 clock cycle cost for a xchgl, both with
> and without a lock on my Athlon XP 1.8GHz.  See attached code. Naturally,
> this is a uniprocessor machine and the memory location is in L1 cache (or
> will be soon), and no contention, so this is definately best case.  4
> clocks is about rights for a read and a write to L1 cache (each L1 cache
> access taking 2 clocks).

$ ./a.out
Minimum time for a rdtsc instruction (in clocks): 88
Minimum time for a read_and_clear() + rdtsc (in clocks): 248

$ grep 'model name' /proc/cpu
model name      : Intel(R) Pentium(R) 4 CPU 2.00GHz


-- 
Best regards,
  Alexander.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Caml-list] Severe loss of performance due to new signal handling (fwd)
  2006-03-21 23:53 [Caml-list] Severe loss of performance due to new signal handling (fwd) Brian Hurt
  2006-03-22  9:20 ` Alexander S. Usov
@ 2006-03-22 10:56 ` Robert Roessler
  1 sibling, 0 replies; 3+ messages in thread
From: Robert Roessler @ 2006-03-22 10:56 UTC (permalink / raw)
  To: Caml-list

Brian Hurt wrote:
> 
> ---------- Forwarded message ----------
> Date: Tue, 21 Mar 2006 16:32:51 -0600 (CST)
> From: Brian Hurt <bhurt@spnz.org>
> To: Robert Roessler <roessler@rftp.com>
> Subject: Re: [Caml-list] Severe loss of performance due to new signal 
> handling
> 
> On Tue, 21 Mar 2006, Robert Roessler wrote:
> 
>> Well, I *thought* there was a marked absence of "bit-level 
>> parallelism" in the signal-handling... ;)
>>
>> So the "expense" of individual atomic operations is not really what is 
>> at the heart of this performance problem...
> 
> Hmm.  Maybe not.  I'm measuring a 4 clock cycle cost for a xchgl, both 
> with and without a lock on my Athlon XP 1.8GHz.  See attached code. 
> Naturally, this is a uniprocessor machine and the memory location is in 
> L1 cache (or will be soon), and no contention, so this is definately 
> best case.  4 clocks is about rights for a read and a write to L1 cache 
> (each L1 cache access taking 2 clocks).

And after adjusting the inline assembly syntax for vc7.1, I get

Minimum time for a rdtsc instruction (in clocks): 38
Minimum time for a read_and_clear() + rdtsc (in clocks): 75

This is on a P-III S (Tualatin) @ 1.4GHz on Windows XP SP2.

Robert Roessler
roessler@rftp.com
http://www.rftp.com


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2006-03-22 10:56 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-03-21 23:53 [Caml-list] Severe loss of performance due to new signal handling (fwd) Brian Hurt
2006-03-22  9:20 ` Alexander S. Usov
2006-03-22 10:56 ` Robert Roessler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox