From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.1 required=5.0 tests=AWL autolearn=disabled version=3.1.3 Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by yquem.inria.fr (Postfix) with ESMTP id CAE59BBCA for ; Wed, 14 May 2008 06:40:06 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AvYCAIcJKkhDWxLCbmdsb2JhbACBU5BGm2U X-IronPort-AV: E=Sophos;i="4.27,484,1204498800"; d="scan'208";a="26144700" Received: from ip67-91-18-194.z18-91-67.customer.algx.net (HELO server1.bertec.net) ([67.91.18.194]) by mail4-smtp-sop.national.inria.fr with ESMTP; 14 May 2008 06:40:06 +0200 Received: from kuba.bertec.net (kuba.bertec.net [192.168.2.16]) by server1.bertec.net (Postfix) with ESMTP id CA65FCDFB6 for ; Wed, 14 May 2008 00:40:04 -0400 (EDT) From: Kuba Ober To: caml-list@yquem.inria.fr Subject: Re: [Caml-list] Re: Why OCaml sucks Date: Wed, 14 May 2008 00:40:03 -0400 User-Agent: KMail/1.9.9 References: <200805090139.54870.jon@ffconsultancy.com> <48299F46.5030906@janestcapital.com> <4829A207.2030601@fischerventure.com> In-Reply-To: <4829A207.2030601@fischerventure.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200805140040.04101.ober.14@osu.edu> X-Spam: no; 0.00; ocaml:01 parsers:01 moderately:01 cheers:01 lexical:01 wrote:01 caml-list:01 writting:01 data:02 expression:02 finite:02 finite:02 megabytes:02 automata:02 automata:02 On Tuesday 13 May 2008, Robert Fischer wrote: > > The problem, as I understand it, is in writting parsers. Your standard > > finite automata based regular expression library or lexical analyzer is > > based, at it's heart, on a table lookup- you have a 2D array, whose size > > is the number of input characters times the number of states. For ASCII > > input, the number of possible input characters is small- 256 at most. > > 256 input characters times hundreds of states isn't that big of a table- > > we're looking at sizes in 10's of K- easily handlable even in the bad > > old days of 64K segments. Even going to UTF-16 ups the number of input > > characters from 256 to 65,536- and now a moderately large state machine > > (hundreds of states) weighs in at tens of megabytes of table space. > > And, of course, if you try to handle the entire 31-bit full unicode > > point space, welcome to really large tables :-). > > > > The solution, I think, is to change the implementation of your finite > > automata to use some data structure smarter than a flat 2D array, but > > that's me. > > Yes. It is certainly possible to write slow code to solve this problem. With "slow code" you could have been meaning two things: 1. Table lookup globally replaced by compares-and-jumps. The latter benefit from branch prediction and speculative execution. So it's not slow anymore. 2. Table "compression" used, where a few compare-and-jumps remove huge "unused" swaths of the table. By "unused" I meant "bomb out with an internal error". I think you're being silly. Stop it. Cheers, Kuba