From: skaller <skaller@users.sourceforge.net>
To: Janne Hellsten <jjhellst@gmail.com>
Cc: caml-list@inria.fr
Subject: Re: [Caml-list] Matching start of input in lexer created with ocamllex
Date: Fri, 06 Apr 2007 15:52:35 +1000 [thread overview]
Message-ID: <1175838755.6045.90.camel@rosella.wigram> (raw)
In-Reply-To: <700d600f0704051358g294d540ey96a546f7f57a08e7@mail.gmail.com>
On Thu, 2007-04-05 at 23:58 +0300, Janne Hellsten wrote:
> I do have the latter construct in my lexer. How does the above help
> me match the beginning of line in the rule
Ok, sorry .. lets try this:
rule headlex =
| "!for" nonalpha { Some FOR }
| { None }
rule taillex = ...
and now use
let make_lexer () =
let atstart = ref true in
fun lexer ->
if !atstart then begin
atstart := false
match headlex lexbuf with
| None ->
lexbuf.curr_pos <- lexbuf.curr_pos -1;
taillex lexbuf
| Some FOR -> FOR
end else taillex lexbuf
let lexer = make_lexer () in
This is an animal, because you have to 'pushback' an
unwanted character following the "for" keyword.
This will cease working if the implementation of lexbuf
changes, in particular if the record is made private.
I may also have got the details on this wrong .. it's
an undocumented feature, you'll need to read lexing.ml
to figure out if it works.
The problem I'm trying to solve here is that you want
rulename lexbuf
to return a single correct token. This is a requirement
of the parser.
In Felix I don't bother with that. Instead, I just return
a pre-token, I gather the whole input as a list, and then
I write filters that translate the list of pre-tokens into
tokens. Then I make a 'lexer' function that just returns
the next token in the list already prepared, and uses
a dummy lexbuf to satisfy the broken ocamlyacc interface.
In other words, I control invert by returning a list and then
actively rewriting it. For example I would do this,
assuming you have multiple lines that can start with "for":
let strip_white toks = filter (fun tok -> tok = WHITE) toks
let append_eol toks = NEWLINE :: toks
let fixup_for toks =
let aux out inp = match inp with
| NEWLINE :: BANG :: IDENT "for" :: tail -> aux (FOR :: out) tail
| NEWLINE :: tail -> aux out tail
| head :: tail -> aux (head::out) tail
| [] -> rev out
in aux [] toks
let tokens = fixup_for (append_eol (strip_white toks))
The point is this process has arbitrary lookahead.
The downside is you have to hold the whole input in memory.
This is not going to be a problem unless it's the encoding
of an exported database.
The tagged automata underlying the lexer can probably be modified
to allow forward context checks, eg to match stuff like:
"!for" / nonalpha
where / denotes lookahead (but I'm not sure). I know TRE can
do this, however ocamllex is using a prebuilt tagged
deterministic automaton, whereas TRE uses a lazily
constructed one.
The technique basically matches
"!for" (TAG) nonalpha
first, but sets the end location to the TAG position. I believe this
might be possible because the automata ALREADY looks ahead and
backtracks to the last success .. so we need to keep two 'success'
points to do this, the real one and the lookahead one. Normally
they're the same, but in this case they're not. The lexer engine
uses the lookahead one until it is finished, but then fudges
its result to the real one.
The two pointers are needed because in general in
regexp1 / regexp2
we have regexp2 'success' point to track. If you allow
regexp1 / (regexp2 / regexp3)
you might need more pointers (not sure), so this would probably
have to be banned (not sure).
--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net
next prev parent reply other threads:[~2007-04-06 5:52 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-04-05 14:37 Janne Hellsten
2007-04-05 19:55 ` [Caml-list] " skaller
2007-04-05 20:58 ` Janne Hellsten
2007-04-06 5:52 ` skaller [this message]
[not found] <20070405205804.90509BC76@yquem.inria.fr>
2007-04-06 7:40 ` David Allsopp
2007-04-06 20:14 ` Janne Hellsten
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1175838755.6045.90.camel@rosella.wigram \
--to=skaller@users.sourceforge.net \
--cc=caml-list@inria.fr \
--cc=jjhellst@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox