* [Caml-list] Some sugar for regexp matching using camlp4
@ 2001-07-16 15:54 Francois Pottier
2001-07-16 17:37 ` Alexander V. Voinov
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Francois Pottier @ 2001-07-16 15:54 UTC (permalink / raw)
To: caml-list
[-- Attachment #1: Type: text/plain, Size: 1613 bytes --]
Hello all,
I have experimented a bit with custom syntax for regular expression
matching. My goal was to implement some high-level constructs on top
of a low-level regexp library such as PCRE. The result of my (modest)
experiment is attached. It is a camlp4 grammar extension, which allows
writing
extract x, y, ... matching e against r in e'
The semantics is as follows. The expression e is evaluated, yielding
a string which is matched against the regular expression r. r must be
either a constant string, or a compiled regular expression; if the
former, pre-compilation code is inserted transparently. The variables
x, y, ... etc. are then bound to the appropriate groups (i.e. x is
bound to the sub-string which matched the whole pattern, y is bound
to the sub-string which matched the first group, etc.) and can be
referred to within e'. Wildcards _ can be used instead of variables.
This is of course pretty modest, but it seems that, with a small
number of such constructs, O'Caml could be turned into a rather nice
textual manipulation language. (Something often requested on this
list.) Opinions and further suggestions are welcome.
--
François Pottier
Francois.Pottier@inria.fr
http://pauillac.inria.fr/~fpottier/
Here's how to use the syntax extension:
1. Compile it:
ocamlc -pp "camlp4o -I `camlp4o -where`" -I `camlp4o -where` -c pcreg.ml
2. At the beginning of your source files, insert
#load "pcreg.cmo";;
3. Compile your source files using the following option:
-pp "camlp4o -I ."
(in addition to any options necessary to include the PCRE library,
e.g. -I +contrib).
[-- Attachment #2: pcreg.ml --]
[-- Type: text/plain, Size: 5147 bytes --]
(* $Header: /net/pauillac/caml/repository/bigbro/pcreg.ml,v 1.1 2001/07/16 15:04:04 fpottier Exp $ *)
open Pcaml
#load "pa_extend.cmo";;
#load "q_MLast.cmo";;
(* ----------------------------------------------------------------------------------------------------------------- *)
(* We begin with an internal utility: a global variable generator, which can be called within grammar rules.
The global variables receive names numbered in a linear fashion. There is a possibility of name clashes
if another module, which uses the same name generator, is ``opened'' and that module does not have a
[.mli] file. It is recommended to always use [.mli] files to describe module interfaces, so these
internal variable names will not be exported. *)
(* This global variable is used to accumulate global variable declarations while the parser is running. *)
let globals =
ref []
(* This function allows registering a new global declaration. It can be called within a grammar rule. *)
let declare (item : MLast.str_item) =
globals := (item, (0, 0) (* dummy location *)) :: !globals
(* This function is used to generate a fresh identifier. *)
let generate =
let count = ref 0 in
fun () ->
incr count;
Printf.sprintf "_regexp_%d" !count
(* This hook, which is called once per implementation file, adds the global declarations generated by calls
to [declare] at the beginning of the module. *)
let _ = EXTEND
implem: FIRST
[[ (sil, stopped) = NEXT ->
let extra = !globals in
globals := [];
(extra @ sil, stopped)
]];
END
(* ----------------------------------------------------------------------------------------------------------------- *)
(* This auxiliary function allows generating code for assertions.
[assert] is dealt with as a kind of special-purpose syntax extension in O'Caml. However, code in quotations must
be expressed in plain (righteous) syntax, which means that it cannot use [assert] directly. Hence, we must use
this code (taken from [camlp4]'s [pa_o.ml]) to generate assertions.
Note that the generated code depends on the value of [camlp4]'s [-noassert] option. This option is distinct
from [ocaml]'s own [-noassert] option. *)
let make_assert loc e =
let f = <:expr< $str:!Pcaml.input_file$ >> in
let bp = <:expr< $int:string_of_int (fst loc)$ >> in
let ep = <:expr< $int:string_of_int (snd loc)$ >> in
let raiser = <:expr< raise (Assert_failure ($f$, $bp$, $ep$)) >> in
if !Pcaml.no_assert
then <:expr< () >>
else <:expr< if $e$ then () else $raiser$ >>
(* ----------------------------------------------------------------------------------------------------------------- *)
(* We continue with syntactic extensions which allow dealing with regular expressions easily.
The syntax
extract s0, s1, ..., sk matching e against r in e'
evaluates the expression [e], matches its value against the regular expression [r] using [Pcre.exec], and binds the
substrings thus obtained to the patterns [s0], [s1], ..., [sk]. (Each [si] must be either a variable or the
wildcard pattern [_].) [Pcre.exec] raises [Not_found] if it doesn't match. The code also contains a dynamic check
(using [assert]) which ensures that the number of extracted substrings, namely $k+1$, is consistent with the
supplied regular expression. Lastly, the expression [r] must be either a string constant, or a compiled regular
expression. If the former, the string is pre-compiled (using a global declaration) into a regular expression. *)
let _ = EXTEND
GLOBAL: expr;
expr: LEVEL "expr1"
[[ (p, e, r, l) = [ "extract"; p = LIST1 simplepat SEP ","; "matching"; e = expr; "against"; r = expr ->
(p, e, r, loc) ]; (* anonymous sub-rule allows extracting partial location [l] *)
"in"; body = expr LEVEL "top" ->
(* If the regular expression is a string constant, generate pre-compilation code for it. *)
let r = match r with
| <:expr< $str:s$ >> ->
let name = generate() in
declare <:str_item< value $lid:name$ = Pcre.regexp $str:s$ >>;
<:expr< $lid:name$ >>
| _ ->
r in
(* Wrap bindings for the substrings around the declaration's body. *)
let body, _ = List.fold_left (fun (body, index) name ->
begin
match name with
| Some name ->
<:expr<
let $lid:name$ = Pcre.get_substring _substrings $int:(string_of_int index)$ in
$body$
>>
| None ->
body
end, index + 1
) (body, 0) p in
(* Wrap a dynamic check around the code thus obtained, to ensure that the number of substrings
extracted out of the pattern is correct. *)
let condition = <:expr< Pcre.num_of_subs _substrings = $int:(string_of_int (List.length p))$ >> in
let assertion = make_assert l condition in
let body = <:expr<
do {
$assertion$;
$body$
}
>> in
(* Wrap the actual pattern matching instruction around the code thus obtained. *)
<:expr<
let _substrings = Pcre.exec ~rex:$r$ $e$ in
$body$
>>
]]
;
simplepat:
[[ x = LIDENT -> Some x
| "_" -> None ]]
;
END
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] Some sugar for regexp matching using camlp4
2001-07-16 15:54 [Caml-list] Some sugar for regexp matching using camlp4 Francois Pottier
@ 2001-07-16 17:37 ` Alexander V. Voinov
2001-07-17 2:36 ` Brian Rogoff
2001-07-17 10:36 ` Markus Mottl
2001-07-17 11:45 ` Michel Schinz
2 siblings, 1 reply; 10+ messages in thread
From: Alexander V. Voinov @ 2001-07-16 17:37 UTC (permalink / raw)
To: Francois.Pottier; +Cc: caml-list
Hi Francois,
Francois Pottier wrote:
> extract x, y, ... matching e against r in e'
> This is of course pretty modest, but it seems that, with a small
> number of such constructs, O'Caml could be turned into a rather nice
> textual manipulation language. (Something often requested on this
> list.) Opinions and further suggestions are welcome.
It would be great. The first question upon the announcement itself (I didn't
yet played with the extension): what does it do when the match fails? Raises
an exception?
Alexander
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] Some sugar for regexp matching using camlp4
2001-07-16 15:54 [Caml-list] Some sugar for regexp matching using camlp4 Francois Pottier
2001-07-16 17:37 ` Alexander V. Voinov
@ 2001-07-17 10:36 ` Markus Mottl
2001-07-17 12:15 ` Francois Pottier
2001-07-17 11:45 ` Michel Schinz
2 siblings, 1 reply; 10+ messages in thread
From: Markus Mottl @ 2001-07-17 10:36 UTC (permalink / raw)
To: Francois Pottier; +Cc: caml-list
On Mon, 16 Jul 2001, Francois Pottier wrote:
> I have experimented a bit with custom syntax for regular expression
> matching. My goal was to implement some high-level constructs on top
> of a low-level regexp library such as PCRE. The result of my (modest)
> experiment is attached. It is a camlp4 grammar extension, which allows
> writing
Nice! This example could surely be used to build a convenient special
purpose language for text manipulation.
> extract x, y, ... matching e against r in e'
>
> The semantics is as follows. The expression e is evaluated, yielding
> a string which is matched against the regular expression r. r must be
> either a constant string, or a compiled regular expression; if the
> former, pre-compilation code is inserted transparently.
Note that it should be possible to assert the required number of subgroups
at compile-time if the user supplied a constant string: you'd only have
to compile the pattern string to a regexp within the camlp4-rule and
check things there. This would even allow you to catch illegal patterns:
static typing for regular expression :-)
Best regards,
Markus Mottl
--
Markus Mottl markus@oefai.at
Austrian Research Institute
for Artificial Intelligence http://www.oefai.at/~markus
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] Some sugar for regexp matching using camlp4
2001-07-17 10:36 ` Markus Mottl
@ 2001-07-17 12:15 ` Francois Pottier
2001-07-17 12:39 ` Markus Mottl
0 siblings, 1 reply; 10+ messages in thread
From: Francois Pottier @ 2001-07-17 12:15 UTC (permalink / raw)
To: Markus Mottl; +Cc: caml-list
> Note that it should be possible to assert the required number of subgroups
> at compile-time if the user supplied a constant string: you'd only have
> to compile the pattern string to a regexp within the camlp4-rule and
> check things there.
Sounds good, except this would require building a custom version of camlp4,
because it can't dynamically load the Pcre library (as far as I can tell).
--
François Pottier
Francois.Pottier@inria.fr
http://pauillac.inria.fr/~fpottier/
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] Some sugar for regexp matching using camlp4
2001-07-17 12:15 ` Francois Pottier
@ 2001-07-17 12:39 ` Markus Mottl
2001-07-17 12:44 ` Daniel de Rauglaudre
0 siblings, 1 reply; 10+ messages in thread
From: Markus Mottl @ 2001-07-17 12:39 UTC (permalink / raw)
To: Francois Pottier; +Cc: caml-list
On Tue, 17 Jul 2001, Francois Pottier wrote:
> > Note that it should be possible to assert the required number of subgroups
> > at compile-time if the user supplied a constant string: you'd only have
> > to compile the pattern string to a regexp within the camlp4-rule and
> > check things there.
>
> Sounds good, except this would require building a custom version of camlp4,
> because it can't dynamically load the Pcre library (as far as I can tell).
I am not a camlp4-guru, but if I am not mistaken, such extensions should
be quite straightforward. If users have to preprocess their files anyway,
they are probably indifferent to whether their preprocessor is the
"plain vanilla" one or not.
Maybe Daniel could tell us how to implement this extension with least
effort?
Regards,
Markus Mottl
--
Markus Mottl markus@oefai.at
Austrian Research Institute
for Artificial Intelligence http://www.oefai.at/~markus
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] Some sugar for regexp matching using camlp4
2001-07-16 15:54 [Caml-list] Some sugar for regexp matching using camlp4 Francois Pottier
2001-07-16 17:37 ` Alexander V. Voinov
2001-07-17 10:36 ` Markus Mottl
@ 2001-07-17 11:45 ` Michel Schinz
2001-07-17 12:18 ` Francois Pottier
2 siblings, 1 reply; 10+ messages in thread
From: Michel Schinz @ 2001-07-17 11:45 UTC (permalink / raw)
To: caml-list
Francois Pottier <Francois.Pottier@inria.fr> writes:
> Hello all,
[...]
> This is of course pretty modest, but it seems that, with a small
> number of such constructs, O'Caml could be turned into a rather nice
> textual manipulation language. (Something often requested on this
> list.) Opinions and further suggestions are welcome.
You might want to look at scsh[1] (my standard suggestion for this
list, it seems). The construct you implemented also exists in scsh,
under the name "let-match" (see page 134 of the scsh manual [2]). Many
other constructs are supported, like "if-match" (similar to let-match
but with a clause to be evaluated when the regular expression does not
match), "match-cond" (tries several regular expressions until one
matches) and so on.
Also very interesting in scsh is the sexp-based notation for regular
expressions (pages 112-... of [2]).
[1] http://www.swiss.ai.mit.edu/ftpdir/scsh/
and http://sourceforge.net/projects/scsh/
[2] ftp://ftp-swiss.ai.mit.edu/pub/su/scsh/scsh-manual.ps
Michel.
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2001-07-17 12:52 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-07-16 15:54 [Caml-list] Some sugar for regexp matching using camlp4 Francois Pottier
2001-07-16 17:37 ` Alexander V. Voinov
2001-07-17 2:36 ` Brian Rogoff
2001-07-17 10:36 ` Markus Mottl
2001-07-17 12:15 ` Francois Pottier
2001-07-17 12:39 ` Markus Mottl
2001-07-17 12:44 ` Daniel de Rauglaudre
2001-07-17 12:52 ` Markus Mottl
2001-07-17 11:45 ` Michel Schinz
2001-07-17 12:18 ` Francois Pottier
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox