From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id q2GEnPdV003189 for ; Fri, 16 Mar 2012 15:49:25 +0100 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgwFAKtRY0/VuiYS/2dsb2JhbABChGhWsHmBB4IJAQEEAQwXVgULCQIaAiYCAiE2BhOIBQkHp26SIIEviCSGFIEWBJVliy+EeYJn X-IronPort-AV: E=Sophos;i="4.73,598,1325458800"; d="scan'208";a="149797753" Received: from solaria.dimino.org ([213.186.38.18]) by mail1-smtp-roc.national.inria.fr with ESMTP/TLS/ADH-AES256-SHA; 16 Mar 2012 15:49:24 +0100 Received: from caladan (caladan.dim [10.200.42.14]) by solaria.dimino.org (Postfix) with ESMTP id 46B0080053; Fri, 16 Mar 2012 15:49:22 +0100 (CET) Received: from caladan.esterel-technologies.com (localhost [127.0.0.1]) by caladan (Postfix) with ESMTP id D7593FF828; Fri, 16 Mar 2012 15:49:02 +0100 (CET) Date: Fri, 16 Mar 2012 15:49:01 +0100 From: =?UTF-8?B?SsOpcsOpbWll?= Dimino To: Philippe Veber Cc: caml users Message-ID: <20120316154901.76602deb@caladan.esterel-technologies.com> In-Reply-To: References: X-Mailer: Claws Mail 3.8.0 (GTK+ 2.24.10; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by walapai.inria.fr id q2GEnPdV003189 Subject: Re: [Caml-list] Efficient scanning of large strings from files Le Fri, 16 Mar 2012 14:03:38 +0100, Philippe Veber a écrit : > Say that you'd like to search a regexp on a file with lines so long > that you'd rather not load them entirely at once. If you can bound > the size of a match by k << length of a line, then you know that you > can only keep a small portion of the line in memory to search the > regexp. Typically you'd like to access substrings of size k from left > to right. I guess such a thing should involve buffered inputs and > avoid copying strings as much as possible. My question is as follows: > has anybody written a library to access these substrings gracefully > and with decent performance? Cheers, You can use a non-backtracking regexp library to find offsets of the substrings, then seek in the file to extract them. You can use for example the libre library from Jérôme Vouillon [1]. It only accept strings as input but it would be really easy to make it work on input channels (just replace "s.[pos]" by "input_char ic"). [1] http://sourceforge.net/projects/libre/ https://github.com/avsm/ocaml-re.git Cheers, -- Jérémie