Thread safe Str

Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed

* Thread safe Str
@ 2005-01-09 19:30 Christophe TROESTLER
  2005-01-09 20:57 ` [Caml-list] " Gerd Stolpmann
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Christophe TROESTLER @ 2005-01-09 19:30 UTC (permalink / raw)
  To: O'Caml Mailing List

Hi,

The new Str module not only made it LGPL but also very fast (100% more
than Pcre in some cases).  However it is still not thread safe -- and
it is not possible given its interface.  However, from a quick look,
it seems to me that what is under the hood is almost thread safe.  So
why not to write a new module (say Regexp) which would be thread safe
and on top of which Str would be build?  This would not be too much
work, would it?

Cheers,
ChriS

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Caml-list] Thread safe Str
  2005-01-09 19:30 Thread safe Str Christophe TROESTLER
@ 2005-01-09 20:57 ` Gerd Stolpmann
  2005-01-10  9:57 ` Alex Baretta
  2005-01-10 15:49 ` Xavier Leroy
  2 siblings, 0 replies; 19+ messages in thread
From: Gerd Stolpmann @ 2005-01-09 20:57 UTC (permalink / raw)
  To: Christophe TROESTLER; +Cc: O'Caml Mailing List

On Son, 2005-01-09 at 20:30, Christophe TROESTLER wrote:
> Hi,
> 
> The new Str module not only made it LGPL but also very fast (100% more
> than Pcre in some cases).  However it is still not thread safe -- and
> it is not possible given its interface.  However, from a quick look,
> it seems to me that what is under the hood is almost thread safe.  So
> why not to write a new module (say Regexp) which would be thread safe
> and on top of which Str would be build?  This would not be too much
> work, would it?

The Netstring_str module (part of ocamlnet) does it the other way round:
The thread-safe interface is built on top of a potentially unsafe
implementation. Although the current version of Netstring_str uses Pcre
as its implementation, there is a historic version on top of Str. This
other way of layering can be made working because there is the global
master lock protecting the C parts of the regexp engine.

The interface of Netstring_str:
http://ocamlnet.sourceforge.net/manual/refman/Netstring_str.html

Of course, this is not the optimal solution, I would prefer it the way
you suggest.

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
------------------------------------------------------------


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Caml-list] Thread safe Str
  2005-01-09 19:30 Thread safe Str Christophe TROESTLER
  2005-01-09 20:57 ` [Caml-list] " Gerd Stolpmann
@ 2005-01-10  9:57 ` Alex Baretta
  2005-01-10 15:49 ` Xavier Leroy
  2 siblings, 0 replies; 19+ messages in thread
From: Alex Baretta @ 2005-01-10  9:57 UTC (permalink / raw)
  To: Christophe TROESTLER, Ocaml

Christophe TROESTLER wrote:
> Hi,
> 
> The new Str module not only made it LGPL but also very fast (100% more
> than Pcre in some cases).  However it is still not thread safe -- and
> it is not possible given its interface.  However, from a quick look,
> it seems to me that what is under the hood is almost thread safe.  So
> why not to write a new module (say Regexp) which would be thread safe
> and on top of which Str would be build?  This would not be too much
> work, would it?
> 
> Cheers,
> ChriS

I think I remember a thread where Xavier explained that Str is fully 
thread safe because all state data structures are allocated on a 
per-thread basis.

Alex

-- 
*********************************************************************
http://www.barettadeit.com/
Baretta DE&IT
A division of Baretta SRL

tel. +39 02 370 111 55
fax. +39 02 370 111 54

Our technology:

The Application System/Xcaml (AS/Xcaml)
<http://www.asxcaml.org/>

The FreerP Project
<http://www.freerp.org/>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Caml-list] Thread safe Str
  2005-01-09 19:30 Thread safe Str Christophe TROESTLER
  2005-01-09 20:57 ` [Caml-list] " Gerd Stolpmann
  2005-01-10  9:57 ` Alex Baretta
@ 2005-01-10 15:49 ` Xavier Leroy
  2005-01-10 16:39   ` Richard Jones
                     ` (3 more replies)
  2 siblings, 4 replies; 19+ messages in thread
From: Xavier Leroy @ 2005-01-10 15:49 UTC (permalink / raw)
  To: Christophe TROESTLER; +Cc: O'Caml Mailing List

> The new Str module not only made it LGPL but also very fast (100% more
> than Pcre in some cases).  However it is still not thread safe -- and
> it is not possible given its interface.  However, from a quick look,
> it seems to me that what is under the hood is almost thread safe.

You are correct: the regexp compiler (written in Caml) and the
execution engine (written in C) are thread-safe, it's only the API (in
Caml) that uses global state.

> So why not to write a new module (say Regexp) which would be thread
> safe and on top of which Str would be build?

That's a very good idea.  My initial plan was to have two APIs, the
old Str for compatibility and a newer, better designed one for new
programs.  Besides being thread-safe, the new API could also expose
the abstract syntax tree for regexps, allowing easier construction of
complex regexps by programs than can be done by working at the level
of the string representation of regexps.  It's just that I never got
to design that new API :-(

> This would not be too much work, would it?

The implementation work should be quite small indeed, but don't
underestimate the work needed to design the API.  API design is harder
than it seems...  But if a few persons on this list want to team up to
design an API, that would be wonderful indeed.  (A small group is
better than individual designers in that several pairs of eyes spot
inconsistencies better.)

- Xavier Leroy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Caml-list] Thread safe Str
  2005-01-10 15:49 ` Xavier Leroy
@ 2005-01-10 16:39   ` Richard Jones
  2005-01-10 18:21   ` Eric C. Cooper
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Richard Jones @ 2005-01-10 16:39 UTC (permalink / raw)
  Cc: O'Caml Mailing List

On Mon, Jan 10, 2005 at 04:49:17PM +0100, Xavier Leroy wrote:
> The implementation work should be quite small indeed, but don't
> underestimate the work needed to design the API.  API design is harder
> than it seems...  But if a few persons on this list want to team up to
> design an API, that would be wonderful indeed.  (A small group is
> better than individual designers in that several pairs of eyes spot
> inconsistencies better.)

Any chance of syntactic support for regexps in the language?  (As in
Perl and particularly Regexp/OCaml[1], for example).

Rich.

[1] http://web.yl.is.s.u-tokyo.ac.jp/~oiwa/pub/caml/regexp-pp-0.9.5.1/README.match-regexp

-- 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Caml-list] Thread safe Str
  2005-01-10 15:49 ` Xavier Leroy
  2005-01-10 16:39   ` Richard Jones
@ 2005-01-10 18:21   ` Eric C. Cooper
  2005-01-10 20:25   ` Martin Jambon
  2005-01-11  6:41   ` Chris King
  3 siblings, 0 replies; 19+ messages in thread
From: Eric C. Cooper @ 2005-01-10 18:21 UTC (permalink / raw)
  To: caml-list, O'Caml Mailing List

On Mon, Jan 10, 2005 at 04:49:17PM +0100, Xavier Leroy wrote:
> My initial plan was to have two APIs, the old Str for compatibility
> and a newer, better designed one for new programs.  Besides being
> thread-safe, the new API could also expose the abstract syntax tree
> for regexps, allowing easier construction of complex regexps by
> programs than can be done by working at the level of the string
> representation of regexps.  It's just that I never got to design
> that new API :-(

The API used by Olin Shivers for regexps in scsh might be a good
starting point.

-- 
Eric C. Cooper          e c c @ c m u . e d u


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Caml-list] Thread safe Str
  2005-01-10 15:49 ` Xavier Leroy
  2005-01-10 16:39   ` Richard Jones
  2005-01-10 18:21   ` Eric C. Cooper
@ 2005-01-10 20:25   ` Martin Jambon
  2005-01-11  3:54     ` skaller
  2005-01-11  6:41   ` Chris King
  3 siblings, 1 reply; 19+ messages in thread
From: Martin Jambon @ 2005-01-10 20:25 UTC (permalink / raw)
  To: Xavier Leroy; +Cc: Christophe TROESTLER, O'Caml Mailing List

On Mon, 10 Jan 2005, Xavier Leroy wrote:

> You are correct: the regexp compiler (written in Caml) and the
> execution engine (written in C) are thread-safe, it's only the API (in
> Caml) that uses global state.

Good to know :-)

> The implementation work should be quite small indeed, but don't
> underestimate the work needed to design the API.  API design is harder
> than it seems...  But if a few persons on this list want to team up to
> design an API, that would be wonderful indeed.  (A small group is
> better than individual designers in that several pairs of eyes spot
> inconsistencies better.)

I would be glad to help if I can.

However, my concerns are more about how to integrate regular expressions
in the language so that they can be checked at compile-time just like the
rest of the program. My Camlp4 syntax extension
(http://martin.jambon.free.fr/micmatch.html) works just fine for this
purpose and I am not asking for any change in the core OCaml syntax ;-)


Martin

--
Martin Jambon, PhD
Researcher in Structural Bioinformatics since the 20th Century
The Burnham Institute http://www.burnham.org
San Diego, California


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Caml-list] Thread safe Str
  2005-01-10 20:25   ` Martin Jambon
@ 2005-01-11  3:54     ` skaller
  2005-01-11  7:03       ` Chris King
  2005-01-11 20:53       ` Martin Jambon
  0 siblings, 2 replies; 19+ messages in thread
From: skaller @ 2005-01-11  3:54 UTC (permalink / raw)
  To: Martin Jambon; +Cc: Xavier Leroy, Christophe TROESTLER, O'Caml Mailing List

On Tue, 2005-01-11 at 07:25, Martin Jambon wrote:

> However, my concerns are more about how to integrate regular expressions
> in the language so that they can be checked at compile-time just like the
> rest of the program. My Camlp4 syntax extension
> (http://martin.jambon.free.fr/micmatch.html) works just fine for this
> purpose and I am not asking for any change in the core OCaml syntax ;-)

But Str is a hack. Like all NFA based solutions, it's unreliable
because it is unsound: possible infinite recursion, indeterminate
capture results, and exponential performance. 

In addition, regular expressions have poor scalability and
fail to provide simple i18n support.

I think Felix does this the right way: core language support
for regular definitions, linear classification and tokensisation,
and no captures. If you want captures use the proper tool,
namely a parser, which is also supported directly in the language
(this is more problematic due to the number of parsers around,
Felix currently provides the GLR parser Elkhound).

So I would love to see integration of regexp support in Ocaml
but I'm very much against Str. If some technology is to
be integrated, please use the right technology and 
integrate Ocamllex.

See the ulex package for a model. The problem with micmatch
is precisely that it does use Str.

BTW: it probably doesn't work either, due to the bug in Str
I mentioned in an earlier post.

-- 
John Skaller, mailto:skaller@users.sf.net
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Caml-list] Thread safe Str
  2005-01-11  3:54     ` skaller
@ 2005-01-11  7:03       ` Chris King
  2005-01-11  8:06         ` skaller
  2005-01-11 20:53       ` Martin Jambon
  1 sibling, 1 reply; 19+ messages in thread
From: Chris King @ 2005-01-11  7:03 UTC (permalink / raw)
  To: skaller; +Cc: O'Caml Mailing List

On 11 Jan 2005 14:54:30 +1100, skaller <skaller@users.sourceforge.net> wrote:
> If you want captures use the proper tool, namely a parser,
> [...]
> If some technology is to be integrated, please use the right technology
> and integrate Ocamllex.

Event-based parsers are not the "proper tool" for many applications. 
Why write a lexer and all its necessary event handlers when one can
just write "s/foo(.*)bar/bar\1foo/g"?  Regular expressions were
designed for pattern matching and substitution, and the latter
function is why they have captures.  Lexers were designed for parsing,
not substitution, and thus excel at the former while making the latter
difficult.  Regexps are certainly the wrong tool for parsing, but they
do have their place.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Caml-list] Thread safe Str
  2005-01-11  7:03       ` Chris King
@ 2005-01-11  8:06         ` skaller
  2005-01-11 12:08           ` Gerd Stolpmann
       [not found]           ` <875c7e070501111007dc3e86d@mail.gmail.com>
  0 siblings, 2 replies; 19+ messages in thread
From: skaller @ 2005-01-11  8:06 UTC (permalink / raw)
  To: Chris King; +Cc: O'Caml Mailing List

On Tue, 2005-01-11 at 18:03, Chris King wrote:
> On 11 Jan 2005 14:54:30 +1100, skaller <skaller@users.sourceforge.net> wrote:
> > If you want captures use the proper tool, namely a parser,
> > [...]
> > If some technology is to be integrated, please use the right technology
> > and integrate Ocamllex.
> 
> Event-based parsers are not the "proper tool" for many applications. 

What are 'event-based' parsers??

> Why write a lexer and all its necessary event handlers when one can
> just write "s/foo(.*)bar/bar\1foo/g"?  

Just compare a real example from the Alioth Shootout:

Specification:
---------------------------------------------------
The telephone number pattern: 

      * there may be zero or one telephone numbers per line of input
      * a telephone number may start at the beginning of the line or be
        preceeded by a non-digit, (which may be preceeded by anything)
      * it begins with a 3-digit area code that looks like this (DDD) or
        DDD (where D is [0-9])
      * the area code is followed by one space
      * which is followed by the 3 digits of the exchange: DDD
      * the exchange is followed by a space or hyphen [ -]
      * which is followed by the last 4 digits: DDDD
      * which can be followed by end of line or a non-digit (which may
        be followed by anything).

------ FELIX SOLUTION---------------------

regexp digit = ["0123456789"];
regexp digits3 = digit digit digit;
regexp digits4 =  digits3 digit;

regexp area_code = digits3 | "(" digits3 ")";
regexp exchange = digits3;

regexp phone = area_code " " exchange (" " | "-") digits4;

fun lexit (start:iterator, finish:iterator): iterator * string =>
  reglex start to finish with
  | phone => check_context (lexeme_start, lexeme_end)
  | _ => ""
  endmatch
;

--------------- PCRE SOLUTION -------------------------

"(?: ^ | [^\\d\\(])           # must be preceeded by non-digit
(\\(\\d\\d\\d\\)|\\d\\d\\d)  # match 1: area code
[ ]                          # area code followed by one space
\\d\\d\\d                    # prefix of 3 digits
[ -]                         # separator is either space or dash
 \\d\\d\\d\\d                 # last 4 digits
(?: \\D|$)                   # must be followed by a non-digit (or EOL)
"

-------------------------------------------------------

If you think the PCRE solution is in any way better -- well
I'd like to point out it is in fact WRONG. The Felix solution
is obviously correct. As an exercise, find the bug in the
idiotic PCRE solution. Oh yes, it is extremely stupid.
In fact there is a much better solution not using 
the irregular back match crap which tried to be tricky
by matching either empty or ( with empty or ) --
and failed utterly. So as an exercise, find this superior
solution.

The bottom line is: a properly supported syntax for
regular matching is superior even for small regexps.

> Regular expressions were
> designed for pattern matching and substitution,

So called regular expressions are NOT mathematically
regular expressions, and they were certainly NOT designed by
people that knew what they were doing from a theoretical viewpoint.

Regular expressions have a precise mathematical foundation,
and they do NOT include captures.

Attempts to add captures to the theory have been made, 
and none are satisfactory IMHO. Certainly none actually agree 
with any implementations.

PCRE has some wording about longest matches
and left most capture but these semantics turn out to
be inconsistent, and are not what PCRE actually implements.

The problem with parsing to obtain captures is that parsers
have not traditionally been well integrated into any programming
languages (except possibly LISP .. :)

The fact is that 'regexps' are actually lame parsers, 
so it is better to get rid of the irregular crud and provide 
a proper regular expression facility and a proper parser. IMHO.

That way we have solid mathemtical foundations for both subsystems,
and the parsing support is actually vastly more capable than
mere regexps (since you can build parse trees :)

Expressions like:

"s/foo(.*)bar/bar\1foo/g"

may well make more sense in a language like Perl which

(a) is dynamically typed

(b) has strong convenient string handling which 
allows the woeful lack of regular definitions to
be replaced by other features such as interpolation.

(c) the language was never expected to scale up
to programming in the large

Ocaml on the other hand is statically typed, doens't
have strong convenient string handling, and is expected
to scale up to programming in the large.

-- 
John Skaller, mailto:skaller@users.sf.net
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Caml-list] Thread safe Str
  2005-01-11  8:06         ` skaller
@ 2005-01-11 12:08           ` Gerd Stolpmann
  2005-01-11 17:55             ` skaller
       [not found]           ` <875c7e070501111007dc3e86d@mail.gmail.com>
  1 sibling, 1 reply; 19+ messages in thread
From: Gerd Stolpmann @ 2005-01-11 12:08 UTC (permalink / raw)
  To: skaller; +Cc: Chris King, O'Caml Mailing List

On Die, 2005-01-11 at 09:06, skaller wrote:
> On Tue, 2005-01-11 at 18:03, Chris King wrote:
> > On 11 Jan 2005 14:54:30 +1100, skaller <skaller@users.sourceforge.net> wrote:
> > > If you want captures use the proper tool, namely a parser,
> > > [...]
> > > If some technology is to be integrated, please use the right technology
> > > and integrate Ocamllex.

Which is already done. There is a camlp4 extension allowing one to
define ocamllex-like scanners as part of normal modules (pa_ocamllex).
It is even part of the O'caml distribution, in camlp4/unmaintained.
Well, there is also a maintained and internationalised version called
ulex by the same author.

> > Why write a lexer and all its necessary event handlers when one can
> > just write "s/foo(.*)bar/bar\1foo/g"?  

This is a typical example where one would prefer regexp engines over
lex-type scanners. Just because of simplicity.

> Just compare a real example from the Alioth Shootout:
> 
> Specification:
> ---------------------------------------------------
> The telephone number pattern: 
> 
>       * there may be zero or one telephone numbers per line of input
>       * a telephone number may start at the beginning of the line or be
>         preceeded by a non-digit, (which may be preceeded by anything)
>       * it begins with a 3-digit area code that looks like this (DDD) or
>         DDD (where D is [0-9])
>       * the area code is followed by one space
>       * which is followed by the 3 digits of the exchange: DDD
>       * the exchange is followed by a space or hyphen [ -]
>       * which is followed by the last 4 digits: DDDD
>       * which can be followed by end of line or a non-digit (which may
>         be followed by anything).

This is already a problem near the limit of writing regexps directly, in
one step. I think the problem is not the complexity of the problem, but
the string notation of regexps which makes any solution hard to read.

However, there are better criterions when to use which lexing
technology:

- When one wants to have a token stream as result: use (ocaml)lex
- When one can profit from recursive lexer definitions: use (ocaml)lex
- When the regexp is computed: use regexp engine

> ------ FELIX SOLUTION---------------------
> 
> regexp digit = ["0123456789"];
> regexp digits3 = digit digit digit;
> regexp digits4 =  digits3 digit;
> 
> regexp area_code = digits3 | "(" digits3 ")";
> regexp exchange = digits3;
> 
> regexp phone = area_code " " exchange (" " | "-") digits4;
> 
> fun lexit (start:iterator, finish:iterator): iterator * string =>
>   reglex start to finish with
>   | phone => check_context (lexeme_start, lexeme_end)
>   | _ => ""
>   endmatch
> ;

The pa_ocamllex/ulex solution would be very similar to this.

> > Regular expressions were
> > designed for pattern matching and substitution,
> 
> So called regular expressions are NOT mathematically
> regular expressions, and they were certainly NOT designed by
> people that knew what they were doing from a theoretical viewpoint.
> 
> Regular expressions have a precise mathematical foundation,
> and they do NOT include captures.

You mean back references, e.g.

((a|b|c)*)d\1

here matching strings where the part of the string before "d" is
identical to the part after "d". The set of matched strings is a
context-sensitive language!

Capturing just for the purpose of string extraction is not problematic.
Back references are unsound when they can match the empty word.

> Attempts to add captures to the theory have been made, 
> and none are satisfactory IMHO. Certainly none actually agree 
> with any implementations.

This is true, but I think it is practically irrelevant.

> PCRE has some wording about longest matches
> and left most capture but these semantics turn out to
> be inconsistent, and are not what PCRE actually implements.

There are often several ways of capturing. In practice, I found this
never to be a real problem, because there are almost always alternate
regexps avoiding such problems.

I still think that a regexp engine is a useful addition to a language
like ocaml. Even with camlp4 integration ocamllex is an overkill
solution for many simple matching problems (notation and execution
overhead).

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
------------------------------------------------------------

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Caml-list] Thread safe Str
  2005-01-11 12:08           ` Gerd Stolpmann
@ 2005-01-11 17:55             ` skaller
  2005-01-11 20:30               ` Gerd Stolpmann
  0 siblings, 1 reply; 19+ messages in thread
From: skaller @ 2005-01-11 17:55 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: Chris King, O'Caml Mailing List

On Tue, 2005-01-11 at 23:08, Gerd Stolpmann wrote:

> Capturing just for the purpose of string extraction is not problematic.

Unfortunately you're wrong. I had in fact hoped this was not
the case, but having read a couple of paper on it,
I now know that, unfortunately, they are. (See below).

Alain Frisch is one of the leaders in this area, perhaps
he can explain it better. You can get Cardelli and Frisch
paper here:

http://felix.sf.net/papers/greedy.pdf

another by Ville Laurikari here:

http://felix.sf.net/papers/spire2000-tnfa.ps

and his Master thesis on the topic here:

http://felix.sf.net/papers/regex-submatch.ps

[and if you can figure out how to actually build
a tagged DFA after reading any of that please let
me know .. ]

Frisch's algorithm used in CDuce works with a forward
pass that ignores captures, but records the automaton states,
and then a backwards pass the extracts the information
(CDuce actually builds trees).

Unfortunately this has a problem: it requires a
bidirectional iterator, whereas a DFA only requires
an input iterator. (NFA's require forward iterators)

-- 
John Skaller, mailto:skaller@users.sf.net
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Caml-list] Thread safe Str
  2005-01-11 17:55             ` skaller
@ 2005-01-11 20:30               ` Gerd Stolpmann
  2005-01-12  7:42                 ` skaller
  0 siblings, 1 reply; 19+ messages in thread
From: Gerd Stolpmann @ 2005-01-11 20:30 UTC (permalink / raw)
  To: skaller; +Cc: Chris King, O'Caml Mailing List

On Die, 2005-01-11 at 18:55, skaller wrote:
> On Tue, 2005-01-11 at 23:08, Gerd Stolpmann wrote:
> 
> 
> > Capturing just for the purpose of string extraction is not problematic.
> 
> Unfortunately you're wrong. I had in fact hoped this was not
> the case, but having read a couple of paper on it,
> I now know that, unfortunately, they are. (See below).

It's a matter of taste. Granted, you cannot define capturing directly
for an automaton, only by a possibly poor backtracking algorithm. (This
means the approach is "definition by algorithm", not really the best way
to do it.)

Which leads to a more serious problem: One must know the algorithm of
the implementation to avoid running into a hard backtracking case (which
can be exponential). There are a lot of tutorials explaining good and
bad regexps. The lex-type scanners do not have this problem (but you may
run into space problems instead because the automaton becomes large).

> Alain Frisch is one of the leaders in this area, perhaps
> he can explain it better. You can get Cardelli and Frisch
> paper here:
> 
> http://felix.sf.net/papers/greedy.pdf

When I understand the direction correctly, this paper explains how to
avoid backtracking by using a two-pass algorithm (automaton plus
expression-based postprocessing to get the captured strings).

> another by Ville Laurikari here:
> 
> http://felix.sf.net/papers/spire2000-tnfa.ps
> 
> and his Master thesis on the topic here:
> 
> http://felix.sf.net/papers/regex-submatch.ps
> 
> [and if you can figure out how to actually build
> a tagged DFA after reading any of that please let
> me know .. ]
> 
> Frisch's algorithm used in CDuce works with a forward
> pass that ignores captures, but records the automaton states,
> and then a backwards pass the extracts the information
> (CDuce actually builds trees).
> 
> Unfortunately this has a problem: it requires a
> bidirectional iterator, whereas a DFA only requires
> an input iterator. (NFA's require forward iterators)

For string regexps this is not a big problem, the string is in memory.
For XML, this is more a problem, especially when the XML document is
only available event by event (processing large documents).

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
------------------------------------------------------------

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Caml-list] Thread safe Str
  2005-01-11 20:30               ` Gerd Stolpmann
@ 2005-01-12  7:42                 ` skaller
  0 siblings, 0 replies; 19+ messages in thread
From: skaller @ 2005-01-12  7:42 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: Chris King, O'Caml Mailing List

On Wed, 2005-01-12 at 07:30, Gerd Stolpmann wrote:

> It's a matter of taste. Granted, you cannot define capturing directly
> for an automaton, only by a possibly poor backtracking algorithm. (This
> means the approach is "definition by algorithm", not really the best way
> to do it.)

It's worse than that ..

> Which leads to a more serious problem: One must know the algorithm of
> the implementation to avoid running into a hard backtracking case (which
> can be exponential). 

.. it's worse .. some cases could be infinite recursion:
see Frisch & Cardelli.

It is known PCRE has this problem. It was hacked to
fix one of the common cases. The hack isn't general.

> There are a lot of tutorials explaining good and
> bad regexps.

Which are worthless junk if you're generating them.. 

The point is -- this is folklore not mathematics.

>  The lex-type scanners do not have this problem (but you may
> run into space problems instead because the automaton becomes large).

Indeed, but they also have another problem: they don't support 
captures at all.

> > 
> > Unfortunately this has a problem: it requires a
> > bidirectional iterator, whereas a DFA only requires
> > an input iterator. (NFA's require forward iterators)
> 
> For string regexps this is not a big problem, the string is in memory.
> For XML, this is more a problem, especially when the XML document is
> only available event by event (processing large documents).

Hehe .. but the main use of the algorithm is in the specialist
XML processing language XDuce...

-- 
John Skaller, mailto:skaller@users.sf.net
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net




^ permalink raw reply	[flat|nested] 19+ messages in thread

[parent not found: <875c7e070501111007dc3e86d@mail.gmail.com>]

[parent not found: <1105471138.2574.88.camel@pelican.wigram>]

[parent not found: <875c7e07050111115618692184@mail.gmail.com>]

* Re: [Caml-list] Thread safe Str
       [not found]               ` <875c7e07050111115618692184@mail.gmail.com>
@ 2005-01-11 19:58                 ` Chris King
  0 siblings, 0 replies; 19+ messages in thread
From: Chris King @ 2005-01-11 19:58 UTC (permalink / raw)
  To: O'Caml Mailing List

Forgot to CC this to the list, sorry if anyone gets a dupe....

> I'm sure you'd agree there are separable issues here:
>
> (1) using a string encoding of a regexp as opposed to
> a lex like one -- this has nothing to do with
> captures.

Yes.  As I stated in another e-mail in this thread, I'd love to see an
API that exposes the parse tree of a regexp.

> (2) Captures

I'd like to add (3) Parsing vs. substitution.  You can't effectively
do the latter without captures (of course it can be done but it's
messy).

> The fact that the regexp syntax is not checked statically
> isn't relevant in a dynamic language since the typing
> of the rest of the program isn't either.

I think we're talking about different things... I used the "s//g"
syntax to represent the substitution function in whatever language is
being used, not as an example of something to be compiled.  (regexp
"foo(.*)bar") certainly has a static type, and if it's malformed it
simply raises an exception.

> I'm trying to provide that in Felix. It has Python style literals,
> and Python style substrings. However it is still clumbsy compared
> with Perl (I guess .. I can't write Perl ..)

Perl string mangling is clumsy compared with Python.  The key is that
Python treats strings as arrays (or lists) of characters.

> > True.  Performing multiple replacements on a single string with a
> > regexp is retarded.  But so is writing a lexer for a simple one-shot
> > replacement job.
>
> That depends on how hard it is to write a lexer.

Specifically, a lexer whose input and output are both strings and
which performs substitution.  Not pretty, unless captures are
provided.

> At present, Felix regexps have to be constants. It will be possible
> in the bootstrapped compiler to generate them as text, and then
> compile and link (i.e. there will be a function sort of like 'eval').

That's not acceptable for, say, an incremental search, though, where a
new regexp must be generated on each keystroke. (Yes I know regexps
aren't the best way to go about that but I'm sure there are better
examples.)

> >  My point is just that regexps are useful enough to co-exist with lexers.
>
> But they're the same thing. Lexer provide regular definitions,
> which is just a way of naming regexps, and reusing the regexps
> by providing the name.

Not in the form of lex/flex/ocamllex.  Yes, lexer token definitions
are equivalent to regexps, but everything else about them is different
(specifically, lexers are event-based, and don't provied captures).
I'd love to see a regexp engine allowing dynamic creation of
token-based regexps, complete with captures.  It could easily serve as
both the base of a lexer and a substitution engine.  Heck, that sounds
like a fun project... what I am doing this weekend? :P

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Caml-list] Thread safe Str
  2005-01-11  3:54     ` skaller
  2005-01-11  7:03       ` Chris King
@ 2005-01-11 20:53       ` Martin Jambon
  2005-01-12  7:59         ` skaller
  1 sibling, 1 reply; 19+ messages in thread
From: Martin Jambon @ 2005-01-11 20:53 UTC (permalink / raw)
  To: skaller
  Cc: Martin Jambon, Xavier Leroy, Christophe TROESTLER,
	O'Caml Mailing List

On 11 Jan 2005, skaller wrote:

> So I would love to see integration of regexp support in Ocaml
> but I'm very much against Str. If some technology is to
> be integrated, please use the right technology and
> integrate Ocamllex.
>
> See the ulex package for a model. The problem with micmatch
> is precisely that it does use Str.

It uses either Str or PCRE. The PCRE variant makes me feel very
satisfied and as you know, a programmer which feels this way is unlikely
to spend time to improve his tools :-)

You can also integrate ulex by yourself if it is possible.
In that case, it should be relatively straightforward to reuse the
existing micmatch core which is already shared between the 2 existing
variants.


Martin

--
Martin Jambon, PhD
Researcher in Structural Bioinformatics since the 20th Century
The Burnham Institute http://www.burnham.org
San Diego, California



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Caml-list] Thread safe Str
  2005-01-11 20:53       ` Martin Jambon
@ 2005-01-12  7:59         ` skaller
  2005-01-12 20:12           ` Martin Jambon
  0 siblings, 1 reply; 19+ messages in thread
From: skaller @ 2005-01-12  7:59 UTC (permalink / raw)
  To: Martin Jambon; +Cc: Xavier Leroy, Christophe TROESTLER, O'Caml Mailing List

On Wed, 2005-01-12 at 07:53, Martin Jambon wrote:
> On 11 Jan 2005, skaller wrote:

> > See the ulex package for a model. The problem with micmatch
> > is precisely that it does use Str.
> 
> It uses either Str or PCRE.

> You can also integrate ulex by yourself if it is possible.

I'm not sure .. ulex doesn't use NFA's does it?
AFIAK it doesn't support captures.

The problem with micmatch is that it probably doesn't do what 
can be done with a system supported in the core language,
linear matching over all cases. Felix does do it: in this code

regmatch .. with
| re1 => ..
| re2 => ..
| re3 =>..
endmatch

the input characters (from ...) are read exactly once,
not only is there no backtracking, since a DFA is used,
but it isn't a sequence of comparisons (it's a DFA based
tokeniser, the token selecting the RHS expression).

I would guess micmatch does not support that, although
it probably could.

-- 
John Skaller, mailto:skaller@users.sf.net
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Caml-list] Thread safe Str
  2005-01-12  7:59         ` skaller
@ 2005-01-12 20:12           ` Martin Jambon
  0 siblings, 0 replies; 19+ messages in thread
From: Martin Jambon @ 2005-01-12 20:12 UTC (permalink / raw)
  To: skaller; +Cc: O'Caml Mailing List

On 12 Jan 2005, skaller wrote:

> On Wed, 2005-01-12 at 07:53, Martin Jambon wrote:
> > On 11 Jan 2005, skaller wrote:
>
> > > See the ulex package for a model. The problem with micmatch
> > > is precisely that it does use Str.
> >
> > It uses either Str or PCRE.
>
> > You can also integrate ulex by yourself if it is possible.
>
> I'm not sure .. ulex doesn't use NFA's does it?
> AFIAK it doesn't support captures.
>
> The problem with micmatch is that it probably doesn't do what
> can be done with a system supported in the core language,
> linear matching over all cases. Felix does do it: in this code
>
> regmatch .. with
> | re1 => ..
> | re2 => ..
> | re3 =>..
> endmatch
>
> the input characters (from ...) are read exactly once,
> not only is there no backtracking, since a DFA is used,
> but it isn't a sequence of comparisons (it's a DFA based
> tokeniser, the token selecting the RHS expression).
>
> I would guess micmatch does not support that, although
> it probably could.

Sorry, I don't know the details of what can be done with DFA or NFA.
The idea of micmatch is to be compatible with the behavior of the regular
match...with, which allows us to mix both forms of pattern matching
(returning the first match, not the longest match).

The following returns `Case1:
match "abcde", 1 with
   RE "abc", 1 -> `Case1
 | RE "abcd", _ -> `Case2

And the following simplified code still returns `Case1:
match "abcde" with
   RE "abc" -> `Case1
 | RE "abcd" -> `Case2

And similarly this returns "abc":
match "abcde" with
   RE ("abc" | "abcd" as x) -> x



I could implement the following syntax:
match subj with
  RE (re1 as x1 => `Case1 x1 as y) | (re2 as x2 => `Case2 x2 as y) -> y

And the simplified forms:
match subj with
  RE (re1 => `Case1 as y) | (re2 => `Case2 as y) -> y

match subj with
  RE (re1 => f1 ()) | (re2 => f2 ()) -> ()

We could then do sophisticated things like that:
match subj with
    RE ... ((re1 => f1 ()) | (re2 => raise Try_next_please)) ...  -> ...
  | ... (* next *) -> ...

but I don't know if these additions would be a good thing.


Martin

--
Martin Jambon, PhD
Researcher in Structural Bioinformatics since the 20th Century
The Burnham Institute http://www.burnham.org
San Diego, California



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Caml-list] Thread safe Str
  2005-01-10 15:49 ` Xavier Leroy
                     ` (2 preceding siblings ...)
  2005-01-10 20:25   ` Martin Jambon
@ 2005-01-11  6:41   ` Chris King
  3 siblings, 0 replies; 19+ messages in thread
From: Chris King @ 2005-01-11  6:41 UTC (permalink / raw)
  To: Xavier Leroy; +Cc: O'Caml Mailing List

On Mon, 10 Jan 2005 16:49:17 +0100, Xavier Leroy <Xavier.Leroy@inria.fr> wrote:
> Besides being thread-safe, the new API could also expose
> the abstract syntax tree for regexps, allowing easier construction of
> complex regexps by programs than can be done by working at the level
> of the string representation of regexps.

Nifty!  This is something I'd really like to see.

> But if a few persons on this list want to team up to
> design an API, that would be wonderful indeed.

I was actually considering writing a wrapper for Str modeled after
Python's object-oriented re library, which I've found to be very
well-designed (much better than any syntax-based solution).  Being
relatively new to OCaml, I'm not sure that classes in the standand
library are considered kosher; would it be preferable to use types
rather than classes?

Also, Python (and IIRC Perl also) support "raw" strings, strings which
do not interpret escape sequences.  These are very useful for avoiding
extra backslash garbage in regular expressions; and (I assume) could
be easily added to OCaml to ease the usage of regexps.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2005-01-12 20:13 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-01-09 19:30 Thread safe Str Christophe TROESTLER
2005-01-09 20:57 ` [Caml-list] " Gerd Stolpmann
2005-01-10  9:57 ` Alex Baretta
2005-01-10 15:49 ` Xavier Leroy
2005-01-10 16:39   ` Richard Jones
2005-01-10 18:21   ` Eric C. Cooper
2005-01-10 20:25   ` Martin Jambon
2005-01-11  3:54     ` skaller
2005-01-11  7:03       ` Chris King
2005-01-11  8:06         ` skaller
2005-01-11 12:08           ` Gerd Stolpmann
2005-01-11 17:55             ` skaller
2005-01-11 20:30               ` Gerd Stolpmann
2005-01-12  7:42                 ` skaller
     [not found]           ` <875c7e070501111007dc3e86d@mail.gmail.com>
     [not found]             ` <1105471138.2574.88.camel@pelican.wigram>
     [not found]               ` <875c7e07050111115618692184@mail.gmail.com>
2005-01-11 19:58                 ` Chris King
2005-01-11 20:53       ` Martin Jambon
2005-01-12  7:59         ` skaller
2005-01-12 20:12           ` Martin Jambon
2005-01-11  6:41   ` Chris King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox