* ocamllex and python-style indentation @ 2009-06-11 12:57 Andrej Bauer 2009-06-11 13:12 ` [Caml-list] " yoann padioleau ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Andrej Bauer @ 2009-06-11 12:57 UTC (permalink / raw) To: caml-list My parsing powers are not sufficient to easily come up with lexer/parser for a simple language that uses python-style indentation and newline rules. Does anyone have such a thing lying around, written in ocamllex/yacc or menhir? I would appreciate a peek to see how you've dealt with it. For example, suppose we want just a very simple fragment of Python involving True, False, conditional statements, variables, and assignments, such as: if True: x = 3 y = (2 + 4 + 5) else: x = 5 if False: x = 8 z = 2 How would I go about writing a lexer/parser for such a thing in ocaml? Andrej ^ permalink raw reply [flat|nested] 22+ messages in thread
* re: [Caml-list] ocamllex and python-style indentation 2009-06-11 12:57 ocamllex and python-style indentation Andrej Bauer @ 2009-06-11 13:12 ` yoann padioleau 2009-06-11 13:21 ` Andreas Rossberg 2009-06-11 13:44 ` Martin Jambon 2 siblings, 0 replies; 22+ messages in thread From: yoann padioleau @ 2009-06-11 13:12 UTC (permalink / raw) To: Andrej Bauer, caml-list > My parsing powers are not sufficient to easily come up with > lexer/parser for a simple language that uses python-style indentation > and newline rules. Does anyone have such a thing lying around, written > in ocamllex/yacc or menhir? I would appreciate a peek to see how > you've dealt with it. > > For example, suppose we want just a very simple fragment of Python > involving True, False, conditional statements, variables, and > assignments, such as: > > if True: > x = 3 > y = (2 + > 4 + 5) > else: > x = 5 > if False: > x = 8 > z = 2 > > How would I go about writing a lexer/parser for such a thing in ocaml? I guess you could interpose a function between lexing and parsing that works on the tokens and position information of those tokens to insert some new extra tokens that will then make the job of the parser easier. For instance you could add some "virtual" ';' at the end of each line except when you are in a parenthesized stuff. > > Andrej > > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] ocamllex and python-style indentation 2009-06-11 12:57 ocamllex and python-style indentation Andrej Bauer 2009-06-11 13:12 ` [Caml-list] " yoann padioleau @ 2009-06-11 13:21 ` Andreas Rossberg 2009-06-11 13:44 ` Martin Jambon 2 siblings, 0 replies; 22+ messages in thread From: Andreas Rossberg @ 2009-06-11 13:21 UTC (permalink / raw) To: Andrej Bauer; +Cc: caml-list On Jun 11, 2009, at 14.57 h, Andrej Bauer wrote: > My parsing powers are not sufficient to easily come up with > lexer/parser for a simple language that uses python-style indentation > and newline rules. Does anyone have such a thing lying around, written > in ocamllex/yacc or menhir? I would appreciate a peek to see how > you've dealt with it. Hi Andrej, I have some _very_ old code (a source-to-source translator that provides Haskell-style syntax for Ocaml itself) that does that: http://www.mpi-sws.org/~rossberg/hocaml-0.12.tgz Hope that helps. Cheers, - Andreas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] ocamllex and python-style indentation 2009-06-11 12:57 ocamllex and python-style indentation Andrej Bauer 2009-06-11 13:12 ` [Caml-list] " yoann padioleau 2009-06-11 13:21 ` Andreas Rossberg @ 2009-06-11 13:44 ` Martin Jambon 2009-06-12 8:20 ` Andrej Bauer 2 siblings, 1 reply; 22+ messages in thread From: Martin Jambon @ 2009-06-11 13:44 UTC (permalink / raw) To: Andrej Bauer; +Cc: caml-list Andrej Bauer wrote: > My parsing powers are not sufficient to easily come up with > lexer/parser for a simple language that uses python-style indentation > and newline rules. Does anyone have such a thing lying around, written > in ocamllex/yacc or menhir? I would appreciate a peek to see how > you've dealt with it. > > For example, suppose we want just a very simple fragment of Python > involving True, False, conditional statements, variables, and > assignments, such as: > > if True: > x = 3 > y = (2 + > 4 + 5) > else: > x = 5 > if False: > x = 8 > z = 2 > > How would I go about writing a lexer/parser for such a thing in ocaml? I would use a first pass that converts the input lines into this imaginary structure: { if True: ; { x = 3 ; y = (2 + ; { 4 + 5) } } ; else: ; { x = 5 ; if False: ; { x = 8 ; z = 2 } } } You could create a generic tool that parses a file into this: type t = Line of loc * string | Block of loc * t list but as suggested by Yoann, the next step should probably be to flatten this into a stream by introducing artificial tokens: type gen_token = Open of loc (* fake "{" *) | Close of loc (* fake "}" *) | Separator of loc (* fake ";" *) | Line of loc * string then parse each Line into a list of tokens and flatten the result into one single token stream: type token = OPEN_BLOCK of loc (* fake "{" *) | CLOSE_BLOCK of loc (* fake "}" *) | SEPARATOR of loc (* fake ";" *) | ... (* your language-specific tokens here *) The token stream could then be processed by ocamlyacc/menhir. That's the approach I would follow if I had to solve this problem again. Martin -- http://mjambon.com/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] ocamllex and python-style indentation 2009-06-11 13:44 ` Martin Jambon @ 2009-06-12 8:20 ` Andrej Bauer 2009-06-12 12:56 ` Martin Jambon ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Andrej Bauer @ 2009-06-12 8:20 UTC (permalink / raw) To: caml-list Thanks to Andreas, I'll have a look at the "old" code. I think I understand the general idea of inserting "virtual" tokens, but the details confuse me still. So starting with > if True: > x = 3 > y = (2 + > 4 + 5) > else: > x = 5 > if False: > x = 8 > z = 2 Martin suggests the following: > { > if True: > ; > { > x = 3 > ; > y = (2 + > ; > { > 4 + 5) > } > } > ; > else: > ; > { > x = 5 > ; > if False: > ; > { > x = 8 > ; > z = 2 > } > } > } I have two questions. Notice that the { ... } and ( ... ) need not be correctly nested (in the top half), so how are we going to deal with this? The second question is, why are there the separators after and just before "else:". I would expect separators inside { .... }, but not around "else". Presumably the intermediate stage that I would preprocess the token stream would have to know about indentation levels. I have not tried this, but ocaml lexer will correctly match things like | '\n' [' ' '\t']* -> { INDENTATION (compute_indentation (lexeme buf)) } Yes? With kind regards, Andrej ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] ocamllex and python-style indentation 2009-06-12 8:20 ` Andrej Bauer @ 2009-06-12 12:56 ` Martin Jambon 2009-06-12 13:34 ` Martin Jambon 2009-06-12 15:43 ` Andreas Rossberg 2 siblings, 0 replies; 22+ messages in thread From: Martin Jambon @ 2009-06-12 12:56 UTC (permalink / raw) To: Andrej Bauer; +Cc: caml-list Andrej Bauer wrote: > Thanks to Andreas, I'll have a look at the "old" code. > > I think I understand the general idea of inserting "virtual" tokens, > but the details confuse me still. So starting with > >> if True: >> x = 3 >> y = (2 + >> 4 + 5) >> else: >> x = 5 >> if False: >> x = 8 >> z = 2 > > Martin suggests the following: > >> { >> if True: >> ; >> { >> x = 3 >> ; >> y = (2 + >> ; >> { >> 4 + 5) >> } >> } >> ; >> else: >> ; >> { >> x = 5 >> ; >> if False: >> ; >> { >> x = 8 >> ; >> z = 2 >> } >> } >> } > > I have two questions. Notice that the { ... } and ( ... ) need not be > correctly nested (in the top half), so how are we going to deal with > this? The second question is, why are there the separators after and > just before "else:". I would expect separators inside { .... }, but > not around "else". Original example: if True: x = 3 y = (2 + 4 + 5) else: x = 5 if False: x = 8 z = 2 For pure indentation concerns, it is equivalent to: x x x x x x x x x Which is parsed into: [ Line; Block [ Line; Line; Block [ Line ] ]; Line; Block [ Line; Line ]; Block [ Line; Line ] ] I wrote the following code, which does the job. You might want to use ocamllex instead in order to better manage newline characters (CRLF...), line number directives and allow input from something else than a file or in_channel. Note that the following must be rejected: x x x (indentation here could be only 0, 4 or more) But this is accepted: x x x x You could also enforce that the indentation of a block must be the current indentation + k, for example k=2 for the whole input. (******************* indent_parser.ml **********************) type indent_line = Lexing.position * (int * string) type indent_tree = [ `Line of (Lexing.position * string) | `Block of (Lexing.position * indent_tree list) ] let split s = let len = String.length s in let result = ref None in try for i = 0 to len - 1 do if s.[i] <> ' ' then ( result := Some (i, String.sub s i (len - i)); raise Exit ) done; None with Exit -> !result let parse_lines fname ic : indent_line list = let lines = ref [] in let lnum = ref 0 in try while true do let bol = pos_in ic in let s = input_line ic in incr lnum; match split s with None -> () | Some ((n, _) as x) -> let pos = { Lexing.pos_fname = fname; pos_lnum = !lnum; pos_bol = bol; pos_cnum = bol + n; } in lines := (pos, x) :: !lines done; assert false with End_of_file -> List.rev !lines let parse_lines_from_file fname = let ic = open_in fname in try let x = parse_lines fname ic in close_in ic; x with e -> close_in_noerr ic; raise e let error pos msg = let cpos = pos.Lexing.pos_cnum - pos.Lexing.pos_bol in let msg = Printf.sprintf "File %S, line %i, characters %i-%i:\n%s" pos.Lexing.pos_fname pos.Lexing.pos_lnum 0 cpos msg in failwith msg let rec block_body cur_indent sub_indent cur_block l : indent_tree list * indent_line list = match l with [] -> (List.rev cur_block, []) | (pos, (n, s)) :: tl -> if n = cur_indent then block_body cur_indent sub_indent (`Line (pos, s) :: cur_block) tl else if n > cur_indent then ( (match sub_indent with None -> () | Some n' -> if n <> n' then error pos "Inconsistent indentation" ); let sub_block, remaining = block_body n None [ `Line (pos, s) ] tl in block_body cur_indent (Some n) (`Block (pos, sub_block) :: cur_block) remaining ) else (List.rev cur_block, l) let parse_indentation fname = let l = parse_lines_from_file fname in let result, remaining = block_body 0 None [] l in assert (remaining = []); result let test () = let fname = Filename.temp_file "test" ".ind" in let oc = open_out fname in output_string oc " if True: x = 3 y = (2 + 4 + 5) else: x = 5 if False: x = 8 z = 2 "; close_out oc; try let result = parse_indentation fname in Sys.remove fname; result with Failure msg as e -> Printf.eprintf "%s\n%!" msg; Sys.remove fname; raise e (*****************************************************************) > Presumably the intermediate stage that I would preprocess the token > stream would have to know about indentation levels. I have not tried > this, but ocaml lexer will correctly match things like > > | '\n' [' ' '\t']* -> { INDENTATION (compute_indentation (lexeme buf)) } > > Yes? Kind of. Don't discard the rest of the line... If you have a choice, reject tabs. Beware of CRLF newlines (\r\n) and missing \n before the end of file. Also ocamllex does not keep track of newlines automatically. See the documentation for Lexing.lexbuf. Martin -- http://mjambon.com/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] ocamllex and python-style indentation 2009-06-12 8:20 ` Andrej Bauer 2009-06-12 12:56 ` Martin Jambon @ 2009-06-12 13:34 ` Martin Jambon 2009-06-12 15:43 ` Andreas Rossberg 2 siblings, 0 replies; 22+ messages in thread From: Martin Jambon @ 2009-06-12 13:34 UTC (permalink / raw) To: Andrej Bauer; +Cc: caml-list Andrej Bauer wrote: > Thanks to Andreas, I'll have a look at the "old" code. > > I think I understand the general idea of inserting "virtual" tokens, > but the details confuse me still. So starting with > >> if True: >> x = 3 >> y = (2 + >> 4 + 5) >> else: >> x = 5 >> if False: >> x = 8 >> z = 2 > > Martin suggests the following: > >> { >> if True: >> ; >> { >> x = 3 >> ; >> y = (2 + >> ; >> { >> 4 + 5) >> } >> } >> ; >> else: >> ; >> { >> x = 5 >> ; >> if False: >> ; >> { >> x = 8 >> ; >> z = 2 >> } >> } >> } > > I have two questions. Notice that the { ... } and ( ... ) need not be > correctly nested (in the top half), so how are we going to deal with > this? It depends on the characteristics of your language. It is generally easier to use several successive passes rather than trying to do everything in one pass. Martin -- http://mjambon.com/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] ocamllex and python-style indentation 2009-06-12 8:20 ` Andrej Bauer 2009-06-12 12:56 ` Martin Jambon 2009-06-12 13:34 ` Martin Jambon @ 2009-06-12 15:43 ` Andreas Rossberg 2009-06-30 18:58 ` Yitzhak Mandelbaum 2 siblings, 1 reply; 22+ messages in thread From: Andreas Rossberg @ 2009-06-12 15:43 UTC (permalink / raw) To: Andrej Bauer; +Cc: caml-list [-- Attachment #1: Type: text/plain, Size: 2630 bytes --] On Jun 12, 2009, at 10.20 h, Andrej Bauer wrote: > I think I understand the general idea of inserting "virtual" tokens, > but the details confuse me still. So starting with > >> if True: >> x = 3 >> y = (2 + >> 4 + 5) >> else: >> x = 5 >> if False: >> x = 8 >> z = 2 > > Martin suggests the following: > >> { >> if True: >> ; >> { >> x = 3 >> ; >> y = (2 + >> ; >> { >> 4 + 5) >> } >> } >> ; >> else: >> ; >> { >> x = 5 >> ; >> if False: >> ; >> { >> x = 8 >> ; >> z = 2 >> } >> } >> } > > I have two questions. Notice that the { ... } and ( ... ) need not be > correctly nested (in the top half), so how are we going to deal with > this? The second question is, why are there the separators after and > just before "else:". I would expect separators inside { .... }, but > not around "else". It depends on how exactly you define your layout rules. The usual approach is to tie start of layout-sensitive blocks to particular keywords -- this is essentially what Python and Haskell do. In that case, the binding to y is not affected. Haskell's rules for optional layout would rewrite your original program as >> if True: >> {x = 3 >> ;y = (2 + >> 4 + 5) >> }else: >> {x = 5 >> ;if False: >> {x = 8 >> ;z = 2 >> }} The basic rules are fairly simple: 1. Insert "{" (assume width 0) before the first token following a layout keyword (usually ":" in Python). This opens a block. 2. As long as inside a block, insert ";" before each token that is on the _same_ column as the current (i.e. innermost) "{". 3. A block ends as soon as you see a line whose first token is _left_ of the current "{". Insert "}" before that token. Blocks can be nested, so you need to maintain a stack of starting columns in the parser. Note that rule 3 may end several blocks at once. EOF is treated as a token at column 0. The way I implemented this is by wrapping the ocamllex-generated lexer with a function that compares each token's column with the top of the layout stack and inserts auxiliary tokens as necessary. Haskell has another rule for inserting "}" if there would be a parse error without it (this is to allow inline blocks). This rule is pretty fudgy, and almost impossible to implement properly with a conventional parser generator. IMO, the only sane way to reformulate this rule is again to tie it to specific keywords, e.g. insert "}" before "else" if missing. This can be implemented in the parser by making closing braces optional in the right places. - Andreas [-- Attachment #2: Type: text/html, Size: 6455 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] ocamllex and python-style indentation 2009-06-12 15:43 ` Andreas Rossberg @ 2009-06-30 18:58 ` Yitzhak Mandelbaum 2009-06-30 20:19 ` Mike Lin 2009-06-30 22:06 ` Andreas Rossberg 0 siblings, 2 replies; 22+ messages in thread From: Yitzhak Mandelbaum @ 2009-06-30 18:58 UTC (permalink / raw) To: Andreas Rossberg; +Cc: Andrej Bauer, caml-list [-- Attachment #1: Type: text/plain, Size: 3721 bytes --] To restart this thread, do your solutions handle the following (legal) variation of the original example? if True: x = 3+4 y = (2 + 4 + 5) z = 5 else: x = 5 if False: x = 8 z = 2 Notice that the assignment of y wraps onto the next line at an *earlier* column. This is legal b/c it is surrounded by parens. However, it seems that the preprocessing approaches will fail for this example. Do you have a workaround? --Yitzhak On Jun 12, 2009, at 11:43 AM, Andreas Rossberg wrote: > On Jun 12, 2009, at 10.20 h, Andrej Bauer wrote: > >> I think I understand the general idea of inserting "virtual" tokens, >> but the details confuse me still. So starting with >> >>> if True: >>> x = 3 >>> y = (2 + >>> 4 + 5) >>> else: >>> x = 5 >>> if False: >>> x = 8 >>> z = 2 >> >> Martin suggests the following: >> >>> { >>> if True: >>> ; >>> { >>> x = 3 >>> ; >>> y = (2 + >>> ; >>> { >>> 4 + 5) >>> } >>> } >>> ; >>> else: >>> ; >>> { >>> x = 5 >>> ; >>> if False: >>> ; >>> { >>> x = 8 >>> ; >>> z = 2 >>> } >>> } >>> } >> >> I have two questions. Notice that the { ... } and ( ... ) need not be >> correctly nested (in the top half), so how are we going to deal with >> this? The second question is, why are there the separators after and >> just before "else:". I would expect separators inside { .... }, but >> not around "else". > > It depends on how exactly you define your layout rules. The usual > approach is to tie start of layout-sensitive blocks to particular > keywords -- this is essentially what Python and Haskell do. In that > case, the binding to y is not affected. Haskell's rules for optional > layout would rewrite your original program as > >>> if True: >>> {x = 3 >>> ;y = (2 + >>> 4 + 5) >>> }else: >>> {x = 5 >>> ;if False: >>> {x = 8 >>> ;z = 2 >>> }} > > The basic rules are fairly simple: > > 1. Insert "{" (assume width 0) before the first token following a > layout keyword (usually ":" in Python). This opens a block. > > 2. As long as inside a block, insert ";" before each token that is > on the _same_ column as the current (i.e. innermost) "{". > > 3. A block ends as soon as you see a line whose first token is > _left_ of the current "{". Insert "}" before that token. > > Blocks can be nested, so you need to maintain a stack of starting > columns in the parser. Note that rule 3 may end several blocks at > once. EOF is treated as a token at column 0. > > The way I implemented this is by wrapping the ocamllex-generated > lexer with a function that compares each token's column with the top > of the layout stack and inserts auxiliary tokens as necessary. > > Haskell has another rule for inserting "}" if there would be a parse > error without it (this is to allow inline blocks). This rule is > pretty fudgy, and almost impossible to implement properly with a > conventional parser generator. IMO, the only sane way to reformulate > this rule is again to tie it to specific keywords, e.g. insert "}" > before "else" if missing. This can be implemented in the parser by > making closing braces optional in the right places. > > - Andreas > > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs -------------------------------------------------- Yitzhak Mandelbaum AT&T Labs - Research http://www.research.att.com/~yitzhak [-- Attachment #2: Type: text/html, Size: 10100 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] ocamllex and python-style indentation 2009-06-30 18:58 ` Yitzhak Mandelbaum @ 2009-06-30 20:19 ` Mike Lin 2009-06-30 22:06 ` Andreas Rossberg 1 sibling, 0 replies; 22+ messages in thread From: Mike Lin @ 2009-06-30 20:19 UTC (permalink / raw) To: Yitzhak Mandelbaum; +Cc: Andreas Rossberg, caml-list, Andrej Bauer More generally, you've got parentheses, comments, and string literals, and you need to know to ignore whitespace within any of those -- and to ignore e.g. an open parenthesis that occurs within a comment, or a close comment that occurs within a string literal. So inevitably you've got to lex and parse at some level to make this work for a practical language. There are still some byzantine cases that ocaml+twt doesn't handle properly, and I think it probably gets pretty close to the minimally complex yet practically usable approach to this. Mike On Tue, Jun 30, 2009 at 2:58 PM, Yitzhak Mandelbaum<yitzhak@research.att.com> wrote: > To restart this thread, do your solutions handle the following (legal) > variation of the original example? > if True: > x = 3+4 > y = (2 + > 4 + 5) > z = 5 > else: > x = 5 > if False: > x = 8 > z = 2 > > Notice that the assignment of y wraps onto the next line at an *earlier* > column. This is legal b/c it is surrounded by parens. However, it seems that > the preprocessing approaches will fail for this example. Do you have a > workaround? > --Yitzhak > > On Jun 12, 2009, at 11:43 AM, Andreas Rossberg wrote: > > On Jun 12, 2009, at 10.20 h, Andrej Bauer wrote: > > I think I understand the general idea of inserting "virtual" tokens, > but the details confuse me still. So starting with > > if True: > > x = 3 > > y = (2 + > > 4 + 5) > > else: > > x = 5 > > if False: > > x = 8 > > z = 2 > > Martin suggests the following: > > { > > if True: > > ; > > { > > x = 3 > > ; > > y = (2 + > > ; > > { > > 4 + 5) > > } > > } > > ; > > else: > > ; > > { > > x = 5 > > ; > > if False: > > ; > > { > > x = 8 > > ; > > z = 2 > > } > > } > > } > > I have two questions. Notice that the { ... } and ( ... ) need not be > correctly nested (in the top half), so how are we going to deal with > this? The second question is, why are there the separators after and > just before "else:". I would expect separators inside { .... }, but > not around "else". > > It depends on how exactly you define your layout rules. The usual approach > is to tie start of layout-sensitive blocks to particular keywords -- this is > essentially what Python and Haskell do. In that case, the binding to y is > not affected. Haskell's rules for optional layout would rewrite your > original program as > > if True: > > {x = 3 > > ;y = (2 + > > 4 + 5) > > }else: > > {x = 5 > > ;if False: > > {x = 8 > > ;z = 2 > > }} > > The basic rules are fairly simple: > 1. Insert "{" (assume width 0) before the first token following a layout > keyword (usually ":" in Python). This opens a block. > 2. As long as inside a block, insert ";" before each token that is on the > _same_ column as the current (i.e. innermost) "{". > 3. A block ends as soon as you see a line whose first token is _left_ of the > current "{". Insert "}" before that token. > Blocks can be nested, so you need to maintain a stack of starting columns in > the parser. Note that rule 3 may end several blocks at once. EOF is treated > as a token at column 0. > The way I implemented this is by wrapping the ocamllex-generated lexer with > a function that compares each token's column with the top of the layout > stack and inserts auxiliary tokens as necessary. > Haskell has another rule for inserting "}" if there would be a parse error > without it (this is to allow inline blocks). This rule is pretty fudgy, and > almost impossible to implement properly with a conventional parser > generator. IMO, the only sane way to reformulate this rule is again to tie > it to specific keywords, e.g. insert "}" before "else" if missing. This can > be implemented in the parser by making closing braces optional in the right > places. > - Andreas > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > > -------------------------------------------------- > Yitzhak Mandelbaum > AT&T Labs - Research > http://www.research.att.com/~yitzhak > > > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] ocamllex and python-style indentation 2009-06-30 18:58 ` Yitzhak Mandelbaum 2009-06-30 20:19 ` Mike Lin @ 2009-06-30 22:06 ` Andreas Rossberg 2009-07-01 2:13 ` Mike Lin 1 sibling, 1 reply; 22+ messages in thread From: Andreas Rossberg @ 2009-06-30 22:06 UTC (permalink / raw) To: Yitzhak Mandelbaum, Mike Lin; +Cc: Andrej Bauer, caml-list On Jun 30, 2009, at 20.58 h, Yitzhak Mandelbaum wrote: > To restart this thread, do your solutions handle the following > (legal) variation of the original example? > > if True: > x = 3+4 > y = (2 + > 4 + 5) > z = 5 > else: > x = 5 > if False: > x = 8 > z = 2 > > Notice that the assignment of y wraps onto the next line at an > *earlier* column. This is legal b/c it is surrounded by parens. > However, it seems that the preprocessing approaches will fail for > this example. Do you have a workaround? Handling that wouldn't be hard with the approach I described. You only need to have parentheses start a dummy block at column 0, so that no tokens will be inserted inside it . However, I don't think it is actually a good idea to allow this. (And I don't think Haskell does, but I may misremember.) On Jun 30, 2009, at 22.19 h, Mike Lin wrote: > More generally, you've got parentheses, comments, and string literals, > and you need to know to ignore whitespace within any of those -- and > to ignore e.g. an open parenthesis that occurs within a comment, or a > close comment that occurs within a string literal. So inevitably > you've got to lex and parse at some level to make this work for a > practical language. Comments and string literals are no problem, since the lex wrapper will never see anything inside them as separate tokens anyway. - Andreas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] ocamllex and python-style indentation 2009-06-30 22:06 ` Andreas Rossberg @ 2009-07-01 2:13 ` Mike Lin 2009-07-01 7:31 ` Andreas Rossberg 0 siblings, 1 reply; 22+ messages in thread From: Mike Lin @ 2009-07-01 2:13 UTC (permalink / raw) To: Andreas Rossberg; +Cc: Yitzhak Mandelbaum, Andrej Bauer, caml-list On Tue, Jun 30, 2009 at 6:06 PM, Andreas Rossberg<rossberg@mpi-sws.org> wrote: > On Jun 30, 2009, at 22.19 h, Mike Lin wrote: >> >> More generally, you've got parentheses, comments, and string literals, >> and you need to know to ignore whitespace within any of those -- and >> to ignore e.g. an open parenthesis that occurs within a comment, or a >> close comment that occurs within a string literal. So inevitably >> you've got to lex and parse at some level to make this work for a >> practical language. > > Comments and string literals are no problem, since the lex wrapper will > never see anything inside them as separate tokens anyway. OCaml comments can be nested, and must be nested parenthetically. :) ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] ocamllex and python-style indentation 2009-07-01 2:13 ` Mike Lin @ 2009-07-01 7:31 ` Andreas Rossberg 2009-07-01 14:02 ` Mike Lin 0 siblings, 1 reply; 22+ messages in thread From: Andreas Rossberg @ 2009-07-01 7:31 UTC (permalink / raw) To: Mike Lin; +Cc: Yitzhak Mandelbaum, Andrej Bauer, caml-list [-- Attachment #1: Type: text/plain, Size: 456 bytes --] On Jul 1, 2009, at 04.13 h, Mike Lin wrote: >> >> Comments and string literals are no problem, since the lex wrapper >> will >> never see anything inside them as separate tokens anyway. > > OCaml comments can be nested, and must be nested parenthetically. :) I know. (Haskell has nested comments, too, btw.) That does not make a difference, though - it is all handled by lex already. In fact, my code handled nested comments just fine. - Andreas [-- Attachment #2: Type: text/html, Size: 879 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] ocamllex and python-style indentation 2009-07-01 7:31 ` Andreas Rossberg @ 2009-07-01 14:02 ` Mike Lin 2009-07-01 14:17 ` Andreas Rossberg 0 siblings, 1 reply; 22+ messages in thread From: Mike Lin @ 2009-07-01 14:02 UTC (permalink / raw) To: Andreas Rossberg; +Cc: Yitzhak Mandelbaum, Andrej Bauer, caml-list On Wed, Jul 1, 2009 at 3:31 AM, Andreas Rossberg<rossberg@mpi-sws.org> wrote: > > I know. (Haskell has nested comments, too, btw.) That does not make a > difference, though - it is all handled by lex already. In fact, my code > handled nested comments just fine. OK, now I'm curious :) how does your lexer match balanced parentheses, or in this case comments? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] ocamllex and python-style indentation 2009-07-01 14:02 ` Mike Lin @ 2009-07-01 14:17 ` Andreas Rossberg 2009-07-01 14:21 ` Andreas Rossberg 2009-07-01 15:03 ` Sylvain Le Gall 0 siblings, 2 replies; 22+ messages in thread From: Andreas Rossberg @ 2009-07-01 14:17 UTC (permalink / raw) To: Mike Lin; +Cc: caml-list Mike Lin wrote: > OK, now I'm curious :) how does your lexer match balanced parentheses, > or in this case comments? > Easily, with a bit of side effects (I think that's roughly how all ML compilers do it): ------------------------------------------------ let error l s = (* ... *) let commentDepth = ref 0 let start = ref 0 let loc length = let pos = !start in (pos, pos+length) rule lex = parse eof { EOF } (* | ... *) | "{-" { start := pos lexbuf; lexNestComment lexbuf } and lexNestComment = parse eof { error (loc 2) "unterminated comment" } | "(*" { incr commentDepth; lexNestComment lexbuf } | "*)" { decr commentDepth; if !commentDepth > 0 then lexNestComment lexbuf else lex lexbuf } | _ { lexNestComment lexbuf } ------------------------------------------------ If you also want to treat strings in comments specially (like OCaml), then you need to do a bit more work, but it's basically the same idea. - Andreas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] ocamllex and python-style indentation 2009-07-01 14:17 ` Andreas Rossberg @ 2009-07-01 14:21 ` Andreas Rossberg 2009-07-01 14:37 ` Mike Lin 2009-07-01 15:03 ` Sylvain Le Gall 1 sibling, 1 reply; 22+ messages in thread From: Andreas Rossberg @ 2009-07-01 14:21 UTC (permalink / raw) To: Mike Lin; +Cc: caml-list I wrote: > ------------------------------------------------ > let error l s = (* ... *) > let commentDepth = ref 0 > let start = ref 0 > let loc length = let pos = !start in (pos, pos+length) > > rule lex = > parse eof { EOF } > (* | ... *) > | "{-" { start := pos lexbuf; > lexNestComment lexbuf } Sorry, the "{-" should be "(*" for ML. And the (* | ... *) is supposed to stand for all other tokens in your lexer. - Andreas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] ocamllex and python-style indentation 2009-07-01 14:21 ` Andreas Rossberg @ 2009-07-01 14:37 ` Mike Lin 0 siblings, 0 replies; 22+ messages in thread From: Mike Lin @ 2009-07-01 14:37 UTC (permalink / raw) To: Andreas Rossberg; +Cc: caml-list That's cheating!!!!!! It's clever :) But clearly, you have a little parser with a stack there. My original contribution to this thread was just that "you've got to lex and parse at some level" before doing the whitespace conversion. Not that you were claiming otherwise, but it seemed like others had alluded to simpler ways which would only work for toy languages. On Wed, Jul 1, 2009 at 10:21 AM, Andreas Rossberg<rossberg@mpi-sws.org> wrote: > I wrote: >> >> ------------------------------------------------ >> let error l s = (* ... *) >> let commentDepth = ref 0 >> let start = ref 0 >> let loc length = let pos = !start in (pos, pos+length) >> >> rule lex = >> parse eof { EOF } >> (* | ... *) >> | "{-" { start := pos lexbuf; >> lexNestComment lexbuf } > > Sorry, the "{-" should be "(*" for ML. And the (* | ... *) is supposed to > stand for all other tokens in your lexer. > > - Andreas > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ocamllex and python-style indentation 2009-07-01 14:17 ` Andreas Rossberg 2009-07-01 14:21 ` Andreas Rossberg @ 2009-07-01 15:03 ` Sylvain Le Gall 2009-07-01 15:16 ` [Caml-list] " Andreas Rossberg 2009-07-01 15:19 ` [Caml-list] " Martin Jambon 1 sibling, 2 replies; 22+ messages in thread From: Sylvain Le Gall @ 2009-07-01 15:03 UTC (permalink / raw) To: caml-list Hello, On 01-07-2009, Andreas Rossberg <rossberg@mpi-sws.org> wrote: > Mike Lin wrote: >> OK, now I'm curious :) how does your lexer match balanced parentheses, >> or in this case comments? >> > > Easily, with a bit of side effects (I think that's roughly how all ML > compilers do it): > > ------------------------------------------------ > let error l s = (* ... *) > let commentDepth = ref 0 > let start = ref 0 > let loc length = let pos = !start in (pos, pos+length) > > rule lex = > parse eof { EOF } > (* | ... *) > | "{-" { start := pos lexbuf; > lexNestComment lexbuf } > > and lexNestComment = > parse eof { error (loc 2) "unterminated comment" } > | "(*" { incr commentDepth; > lexNestComment lexbuf } > | "*)" { decr commentDepth; > if !commentDepth > 0 > then lexNestComment lexbuf > else lex lexbuf } > | _ { lexNestComment lexbuf } > ------------------------------------------------ > > If you also want to treat strings in comments specially (like OCaml), > then you need to do a bit more work, but it's basically the same idea. > May I recommend you to write this in a more simple way: ------------------------------------------------------------------------- rule lex = parse eof { () } | "(*" { start := pos lexbuf; lexNestComment lexbuf; lex lexbuf } and lexNestComment = parse eof { error (loc 2) "unterminated comment" } | "(*" { lexNestComment lexbuf } | "*)" { () } | _ { lexNestComment lexbuf } ------------------------------------------------------------------------- I think it works the same way, except that it uses less global variables. Regards, Sylvain Le Gall ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] Re: ocamllex and python-style indentation 2009-07-01 15:03 ` Sylvain Le Gall @ 2009-07-01 15:16 ` Andreas Rossberg 2009-07-01 16:26 ` Sylvain Le Gall 2009-07-01 15:19 ` [Caml-list] " Martin Jambon 1 sibling, 1 reply; 22+ messages in thread From: Andreas Rossberg @ 2009-07-01 15:16 UTC (permalink / raw) To: Sylvain Le Gall; +Cc: caml-list Sylvain Le Gall wrote: > May I recommend you to write this in a more simple way: > > ------------------------------------------------------------------------- > rule lex = > parse eof { () } > | "(*" { start := pos lexbuf; lexNestComment lexbuf; lex lexbuf } > > and lexNestComment = > parse eof { error (loc 2) "unterminated comment" } > | "(*" { lexNestComment lexbuf } > | "*)" { () } > | _ { lexNestComment lexbuf } > ------------------------------------------------------------------------- > Mh, I think in lexNestComment it should at least say | "(*" { lexNestComment lexbuf; lexNestComment lexbuf } That might work. I am not sure how well the various lexer generators handle arbitrary recursive invocations and reentrance, though. Have you tried it? If it works, yes, that's certainly a nicer version. - Andreas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ocamllex and python-style indentation 2009-07-01 15:16 ` [Caml-list] " Andreas Rossberg @ 2009-07-01 16:26 ` Sylvain Le Gall 0 siblings, 0 replies; 22+ messages in thread From: Sylvain Le Gall @ 2009-07-01 16:26 UTC (permalink / raw) To: caml-list On 01-07-2009, Andreas Rossberg <rossberg@mpi-sws.org> wrote: > Sylvain Le Gall wrote: >> May I recommend you to write this in a more simple way: >> >> ------------------------------------------------------------------------- >> rule lex = >> parse eof { () } >> | "(*" { start := pos lexbuf; lexNestComment lexbuf; lex lexbuf } >> >> and lexNestComment = >> parse eof { error (loc 2) "unterminated comment" } >> | "(*" { lexNestComment lexbuf } >> | "*)" { () } >> | _ { lexNestComment lexbuf } >> ------------------------------------------------------------------------- >> > > Mh, I think in lexNestComment it should at least say > > | "(*" { lexNestComment lexbuf; lexNestComment lexbuf } > > > That might work. I am not sure how well the various lexer generators > handle arbitrary recursive invocations and reentrance, though. Have you > tried it? If it works, yes, that's certainly a nicer version. > Yes, you're right, here it is: ------------------------------------------------------------------------ { let start = ref 0 let error l s = failwith "toto" let loc length = let pos = !start in (pos, pos+length) let pos _ = 0 } rule lex = parse eof { () } | "(*" { start := pos lexbuf; lexNestComment lexbuf; lex lexbuf } | _ { print_string (Lexing.lexeme lexbuf); lex lexbuf } and lexNestComment = parse eof { error (loc 2) "unterminated comment" } | "(*" { lexNestComment lexbuf; lexNestComment lexbuf } | "*)" { () } | _ { lexNestComment lexbuf } { let chn = open_in "nested.ml" in lex (Lexing.from_channel chn); close_in chn } ------------------------------------------------------------------------ and nested.ml: ----------------------------------- (* Comment 1 *) "Comment1 ok";; (* Commenbt 2 (* test *) *) "Comment 2 ok";; (* totot ... (* (* *) (**) *)*) "Comment 3 ok";; ----------------------------------- and the result: ---------------------------------------------------- ocamllex lexer.mll 9 states, 260 transitions, table size 1094 bytes ocamlc -o lexer lexer.ml ./lexer "Comment1 ok";; "Comment 2 ok";; "Comment 3 ok";; ----------------------------------------------------- It works ;-) Regards, Sylvain Le Gall ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] Re: ocamllex and python-style indentation 2009-07-01 15:03 ` Sylvain Le Gall 2009-07-01 15:16 ` [Caml-list] " Andreas Rossberg @ 2009-07-01 15:19 ` Martin Jambon 2009-07-01 15:43 ` Andreas Rossberg 1 sibling, 1 reply; 22+ messages in thread From: Martin Jambon @ 2009-07-01 15:19 UTC (permalink / raw) To: Sylvain Le Gall; +Cc: caml-list Sylvain Le Gall wrote: > Hello, > > On 01-07-2009, Andreas Rossberg <rossberg@mpi-sws.org> wrote: >> Mike Lin wrote: >>> OK, now I'm curious :) how does your lexer match balanced parentheses, >>> or in this case comments? >>> >> Easily, with a bit of side effects (I think that's roughly how all ML >> compilers do it): >> >> ------------------------------------------------ >> let error l s = (* ... *) >> let commentDepth = ref 0 >> let start = ref 0 >> let loc length = let pos = !start in (pos, pos+length) >> >> rule lex = >> parse eof { EOF } >> (* | ... *) >> | "{-" { start := pos lexbuf; >> lexNestComment lexbuf } >> >> and lexNestComment = >> parse eof { error (loc 2) "unterminated comment" } >> | "(*" { incr commentDepth; >> lexNestComment lexbuf } >> | "*)" { decr commentDepth; >> if !commentDepth > 0 >> then lexNestComment lexbuf >> else lex lexbuf } >> | _ { lexNestComment lexbuf } >> ------------------------------------------------ >> >> If you also want to treat strings in comments specially (like OCaml), >> then you need to do a bit more work, but it's basically the same idea. >> > > May I recommend you to write this in a more simple way: > > ------------------------------------------------------------------------- > rule lex = > parse eof { () } > | "(*" { start := pos lexbuf; lexNestComment lexbuf; lex lexbuf } > > and lexNestComment = > parse eof { error (loc 2) "unterminated comment" } > | "(*" { lexNestComment lexbuf } > | "*)" { () } > | _ { lexNestComment lexbuf } > ------------------------------------------------------------------------- > > I think it works the same way, except that it uses less global > variables. You can even get rid of global variables completely: rule lex x = parse eof { () } | "(*" { x.start <- pos lexbuf; lexNestComment x lexbuf; lex x lexbuf } and lexNestComment x = parse eof { error (loc x 2) "unterminated comment" } | "(*" { lexNestComment x lexbuf } | "*)" { () } | _ { lexNestComment x lexbuf } Martin -- http://mjambon.com/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] Re: ocamllex and python-style indentation 2009-07-01 15:19 ` [Caml-list] " Martin Jambon @ 2009-07-01 15:43 ` Andreas Rossberg 0 siblings, 0 replies; 22+ messages in thread From: Andreas Rossberg @ 2009-07-01 15:43 UTC (permalink / raw) To: Martin Jambon; +Cc: Sylvain Le Gall, caml-list Martin Jambon wrote: > You can even get rid of global variables completely: > > > rule lex x = parse > eof { () } > | "(*" { x.start <- pos lexbuf; lexNestComment x lexbuf; lex x lexbuf } > > and lexNestComment x = parse > eof { error (loc x 2) "unterminated comment" } > | "(*" { lexNestComment x lexbuf } > | "*)" { () } > | _ { lexNestComment x lexbuf } > Nicer indeed. To my defense I have to say that I wrote the code long before rule arguments were introduced into ocamllex - I said it was really old... :-) - Andreas ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2009-07-01 16:27 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-06-11 12:57 ocamllex and python-style indentation Andrej Bauer 2009-06-11 13:12 ` [Caml-list] " yoann padioleau 2009-06-11 13:21 ` Andreas Rossberg 2009-06-11 13:44 ` Martin Jambon 2009-06-12 8:20 ` Andrej Bauer 2009-06-12 12:56 ` Martin Jambon 2009-06-12 13:34 ` Martin Jambon 2009-06-12 15:43 ` Andreas Rossberg 2009-06-30 18:58 ` Yitzhak Mandelbaum 2009-06-30 20:19 ` Mike Lin 2009-06-30 22:06 ` Andreas Rossberg 2009-07-01 2:13 ` Mike Lin 2009-07-01 7:31 ` Andreas Rossberg 2009-07-01 14:02 ` Mike Lin 2009-07-01 14:17 ` Andreas Rossberg 2009-07-01 14:21 ` Andreas Rossberg 2009-07-01 14:37 ` Mike Lin 2009-07-01 15:03 ` Sylvain Le Gall 2009-07-01 15:16 ` [Caml-list] " Andreas Rossberg 2009-07-01 16:26 ` Sylvain Le Gall 2009-07-01 15:19 ` [Caml-list] " Martin Jambon 2009-07-01 15:43 ` Andreas Rossberg
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox