* ocamllex regexp problem
@ 2008-03-19 2:03 Jake Donham
2008-03-19 9:00 ` [Caml-list] " Michael Wohlwend
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Jake Donham @ 2008-03-19 2:03 UTC (permalink / raw)
To: caml-list
[-- Attachment #1: Type: text/plain, Size: 1097 bytes --]
Hi list,
I am trying to parse an RSS feed using OCaml-RSS, which uses XML-Light,
which however does not support CDATA blocks. So I added support in the
ocamllex-based lexer as follows:
let ends_sq = [^']']* ']'
let ends_sq_sq = ends_sq ([^']'] ends_sq)* ']'+
let ends_sq_sq_ang = ends_sq_sq ([^'>'] ends_sq_sq)* '>'
or expanded:
let ends_sq_sq_ang = (([^']']*']') ([^']'] ([^']']*']'))* ']'+) ([^'>']
(([^']']*']') ([^']'] ([^']']*']'))* ']'+))* '>'
rule token = parse
[...]
| "<![CDATA[" (ends_sq_sq_ang as data)
[...]
Here ends_sq_sq_ang is supposed to match strings ending in ]]> which may
contain ] and >. If I give it an input like "foo]]]>bar]]>" (note the extra
square bracket after foo), ocamllex matches the whole input instead of just
"foo]]]>" as I would expect. But Micmatch, when given the same regexp, does
the right thing. (The ']'+ bits are supposed to handle the "]]]>" case.)
I have probably done something stupid and am embarrassing myself by
advertising it to the list, but I did check it carefully. Any idea why this
doesn't work? Thanks,
Jake
[-- Attachment #2: Type: text/html, Size: 1486 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Caml-list] ocamllex regexp problem
2008-03-19 2:03 ocamllex regexp problem Jake Donham
@ 2008-03-19 9:00 ` Michael Wohlwend
2008-03-19 15:21 ` Martin Jambon
2008-03-19 16:39 ` Jake Donham
2 siblings, 0 replies; 4+ messages in thread
From: Michael Wohlwend @ 2008-03-19 9:00 UTC (permalink / raw)
To: caml-list
Am Mittwoch, 19. März 2008 03:03:25 schrieb Jake Donham:
> Hi list,
> rule token = parse
> [...]
I think the longest match rule eats the whole input.
maybe you want:
rule token = parse
[...]
| "<![CDATA[" { cdata lexbuf }
[...]
and cdata = shortest
| (_* as d) "]]>" { Printf.printf "found data:'%s'\n" d; }
Michael
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Caml-list] ocamllex regexp problem
2008-03-19 2:03 ocamllex regexp problem Jake Donham
2008-03-19 9:00 ` [Caml-list] " Michael Wohlwend
@ 2008-03-19 15:21 ` Martin Jambon
2008-03-19 16:39 ` Jake Donham
2 siblings, 0 replies; 4+ messages in thread
From: Martin Jambon @ 2008-03-19 15:21 UTC (permalink / raw)
To: Jake Donham; +Cc: caml-list
On Tue, 18 Mar 2008, Jake Donham wrote:
> Hi list,
>
> I am trying to parse an RSS feed using OCaml-RSS, which uses XML-Light,
> which however does not support CDATA blocks. So I added support in the
> ocamllex-based lexer as follows:
>
> let ends_sq = [^']']* ']'
> let ends_sq_sq = ends_sq ([^']'] ends_sq)* ']'+
> let ends_sq_sq_ang = ends_sq_sq ([^'>'] ends_sq_sq)* '>'
>
> or expanded:
>
> let ends_sq_sq_ang = (([^']']*']') ([^']'] ([^']']*']'))* ']'+) ([^'>']
> (([^']']*']') ([^']'] ([^']']*']'))* ']'+))* '>'
>
> rule token = parse
> [...]
> | "<![CDATA[" (ends_sq_sq_ang as data)
> [...]
>
> Here ends_sq_sq_ang is supposed to match strings ending in ]]> which may
> contain ] and >. If I give it an input like "foo]]]>bar]]>" (note the extra
> square bracket after foo), ocamllex matches the whole input instead of just
> "foo]]]>" as I would expect. But Micmatch, when given the same regexp, does
> the right thing. (The ']'+ bits are supposed to handle the "]]]>" case.)
>
> I have probably done something stupid and am embarrassing myself by
> advertising it to the list, but I did check it carefully. Any idea why this
> doesn't work? Thanks,
It's interesting. Note that both solutions are correct.
Using "shortest" instead of "parse" returns the shorter solution for this
particular example. That may solve your problem.
In general, I find it hard to predict which solution should pop up earlier
when some complex backtracking is involved, independently from any
theoretical reasons.
My advice would be to use PCRE (from micmatch) for line-oriented parsing
and take advantage of lazy quantifiers and assertions or ocamllex when
end-of-lines are insignificant and things are nicely nested.
If it's not so simple, try to make several passes, possibly starting by
discovering blocks based on indentation and then parse each block
afterwards using another technique.
When in addition you have to extract the most out of your data even if
some syntax errors are present, it gets hard. When you must tolerate these
errors exactly in the same way as an existing dominant implementation
(such as Mediawiki), it tends to become impossible.
Martin
--
http://wink.com/profile/mjambon
http://martin.jambon.free.fr
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: ocamllex regexp problem
2008-03-19 2:03 ocamllex regexp problem Jake Donham
2008-03-19 9:00 ` [Caml-list] " Michael Wohlwend
2008-03-19 15:21 ` Martin Jambon
@ 2008-03-19 16:39 ` Jake Donham
2 siblings, 0 replies; 4+ messages in thread
From: Jake Donham @ 2008-03-19 16:39 UTC (permalink / raw)
To: caml-list
[-- Attachment #1: Type: text/plain, Size: 577 bytes --]
On Tue, Mar 18, 2008 at 7:03 PM, Jake Donham <jake.donham@skydeck.com>
wrote:
> let ends_sq = [^']']* ']'
> let ends_sq_sq = ends_sq ([^']'] ends_sq)* ']'+
> let ends_sq_sq_ang = ends_sq_sq ([^'>'] ends_sq_sq)* '>'
My colleague Haoyang Wang points out that my regexp, when viewed
nondeterministically, matches "foo]]]>bar]]>", since ']'+ may match only
"]]", then [^'>'] matches the third "]". Changing it to [^'>'']'] repairs
it. So I guess the answer is that Micmatch on PCRE treats the regexp as
greedy, while ocamllex does not.
Thanks to those who replied,
Jake
[-- Attachment #2: Type: text/html, Size: 981 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2008-03-19 16:39 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-19 2:03 ocamllex regexp problem Jake Donham
2008-03-19 9:00 ` [Caml-list] " Michael Wohlwend
2008-03-19 15:21 ` Martin Jambon
2008-03-19 16:39 ` Jake Donham
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox