Subj.

In functional languages many sequential constructors frequently occur. Each 
constructor with arguments causes memory allocation. For example:

    let f x y z = [x+y;y+z;z+x]

compiled by O'Caml into the following (real code for x86 placed at the end of 
the message) code:

	t  := alloc list item ; t.cdr  := NIL; 	t.car  := z+x
	t' := alloc list item ; t'.cdr := t; 	t'.car := y+z
	t" := alloc list item ; t".cdr := t'; 	t".car := x+y
	return t"

For x86 each allocation takes 6 commands:

.L101:	movl	young_ptr, %eax
	subl	$12, %eax
	movl	%eax, young_ptr
	cmpl	young_limit, %eax
	jb	.L102
	leal	4(%eax), %edx

(and also `caml_call_gc; jmp .L101' at the function end, and frame for GC with 
approx 8-16B size) - total about 40B.

If we allocate memory for all 3 list items by one request, then we can replace 
each of the two last allocations by the following:

       mov young_ptr, %eax
       lea offset(%eax), %reg

8B and nothing more.
	
This optimization is valid only in basic blocks and olny if code between
allocations can't call a garbage collection.

I made it. This takes about 90 lines of added/changed code in compiler
(together with the two changes described below). This optimization
reduces code size of ocamlopt.opt+ocamlc.opt by 8.7%. I think this is an
excellent result for 90-lines changes.

Bootstrapping of ocamlopt.opt was successfull. This means that my changes
are correct, I hope.

This is an optimization which can be applied to all architectures. For  
architectures with `young_ptr' in the memory (x86, m68k) yet another
improvement exists: in many cases instead of loading `young_ptr' from 
memory we can use address of the object created by previous constructor which 
is `young_ptr + offset' and is frequently located in one of the registers 
because it is the argument of the constructor following it. In this case we 
eliminate the first of the two remaining commands. This optimization 
reduces ocamlopt.opt+ocamlc.opt code for x86 by 1.6%.

And the last: on x86 and m68k architectures `selection.ml' contains the
following method:

method select_store addr exp =
  match exp with
    Cconst_int n -> (Ispecific(Istore_int(n, addr)), Ctuple [])
  | Cconst_pointer n -> (Ispecific(Istore_int(n, addr)), Ctuple [])
  | Cconst_symbol s -> (Ispecific(Istore_symbol(s, addr)), Ctuple [])
  | _ -> super#select_store addr exp

the alternative
    Cconst_int n -> (Ispecific(Istore_int(n, addr)), Ctuple [])

processes storing of the Cconst_int immediate constants, but ignores the  
Cconst_natint constants. This causes generating the following bad code  
immediately after each memory allocation:

	mov	$tag, %r1
	mov	%r1, -4(%r2)

instead of a better:

	mov	$tag, -4(%r2).

I fixed this by adding the following match pattern:

  | Cconst_natint n 
      when Nativeint.cmp n min_int >= 0  
      &&   Nativeint.cmp n max_int <= 0 
    ->
      (Ispecific(Istore_int(Nativeint.to_int n, addr)), Ctuple [])

This change improves code size of ocamlc.opt+ocamlopt.opt by yet 0.7%.
The same change needed for m68k.
A better solution probably will be to add the operator Istore_natint.


I estimated the number of memory allocations in ocamlopt.opt+ocamlc.opt. 
I found about 12,000 memory allocations approximately 7,000 of which is the 
subject of the described optimizations.

Table of code sizes:

             old size:   new size-1:     new size-2:     new size-3:   total: