* Parallelized parsing @ 2009-04-20 21:15 Jon Harrop 2009-04-20 21:35 ` [Caml-list] " Mike Lin ` (2 more replies) 0 siblings, 3 replies; 8+ messages in thread From: Jon Harrop @ 2009-04-20 21:15 UTC (permalink / raw) To: caml-list I'm desperately trying to prepare for the imminent drop of a rock-solid multicore-friendly OCaml implementation and was wondering what work has been done on parallelized parsers and/or parallel-friendly grammars? For example, Mathematica syntax for nested lists of integers looks like: {{{1, 2}}, {{3, 4}, {4, 5}}, ..} and there are obvious divide-and-conquer approaches to lexing and parsing that grammar. You can recursively subdivide the string (e.g. memory mapped from a file) to build a tree of where the tokens { , and } appear by index and then recursively convert the tree into an AST. What other grammars can be lexed and/or parsed efficiently in parallel? -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Caml-list] Parallelized parsing 2009-04-20 21:15 Parallelized parsing Jon Harrop @ 2009-04-20 21:35 ` Mike Lin 2009-04-21 0:52 ` Yitzhak Mandelbaum 2009-04-21 1:44 ` Polymorphism problem Eliot Handelman 2009-04-21 7:19 ` [Caml-list] Parallelized parsing David MENTRE 2 siblings, 1 reply; 8+ messages in thread From: Mike Lin @ 2009-04-20 21:35 UTC (permalink / raw) To: caml-list There is certainly a reasonable body of basic CS research on parallelizing CFG algorithms such as CYK, the Earley parser, and to a lesser extent the more practical LALR strategy used by yacc etc. (In the latter case it seems to get easier if you're willing to trade off determinism when parsing ambiguous grammars.) I know some people who use some of this stuff in very specific contexts (RNA folding), but I haven't seen any practical general-purpose tools like a parallel yacc... Overall, I don't actually know much more than you could figure out from Google Scholar in an hour but hopefully these were some useful search terms. On Mon, Apr 20, 2009 at 5:15 PM, Jon Harrop <jon@ffconsultancy.com> wrote: > > I'm desperately trying to prepare for the imminent drop of a rock-solid > multicore-friendly OCaml implementation and was wondering what work has been > done on parallelized parsers and/or parallel-friendly grammars? > > For example, Mathematica syntax for nested lists of integers looks like: > > {{{1, 2}}, {{3, 4}, {4, 5}}, ..} > > and there are obvious divide-and-conquer approaches to lexing and parsing that > grammar. You can recursively subdivide the string (e.g. memory mapped from a > file) to build a tree of where the tokens { , and } appear by index and then > recursively convert the tree into an AST. > > What other grammars can be lexed and/or parsed efficiently in parallel? > > -- > Dr Jon Harrop, Flying Frog Consultancy Ltd. > http://www.ffconsultancy.com/?e > > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Caml-list] Parallelized parsing 2009-04-20 21:35 ` [Caml-list] " Mike Lin @ 2009-04-21 0:52 ` Yitzhak Mandelbaum 2009-04-21 15:55 ` Jon Harrop 0 siblings, 1 reply; 8+ messages in thread From: Yitzhak Mandelbaum @ 2009-04-21 0:52 UTC (permalink / raw) To: Mike Lin; +Cc: caml-list Unfortunately, most forms of parsing are not terribly amenable to efficient parallelization because of the irregular nature of the subcomponents of the parsing problem. That is, you can't easily break up the problem into subcomponents that can be farmed out to different CPUs. That said, if you've just got CPU's lying around, unused anyhow and wasting resources isn't that important, then there is plenty of work on this topic which might be applicable. Some good places to start are here: A. Nijholt. Parallel approaches to context-free language parsing. Chapter 2 in: Parallel Natural Language Processing, U. Hahn and G. Adriaens (eds.), Ablex Publishing Corporation, Norwood, New Jersey, 1994, 135-167 (ISBN 0.89391.869.5). and here: Survey of Parallel Context-Free Parsing Techniques, M. P. van Lohuizen, 1997. In case you're really interested in the topic, here's a more complete list of references with assorted notes (my own). References below are taken from the above survey. * **Parallel Natural Language Processing**. A book on parallel parsing algorithms. By Geert Adriaens, Udo Hahn. [Available in Google Books](http://books.google.com/books?id=G9-67_mQPnkC). Some relevant chapters follow * YO94: nonterminal-per-processor. Akinori Yonezawa and Ichiro Ohsawa. Object-oriented parallel parsing for context-free grammers. In Adriaens and Hahn. * Fan94: "Connectionist" parsing. Good for massively parallel machines. Survey authors comment that not promising for CF parsing. Mark Fanty. Context-free parsing in connectionist networks. In Adriaens and Hahn. * Sik93b: A Cross-breeding of Tomita and Earley. Klaas Sikkel. **Parsing Schemata**. PhD thesis. Dept of Computer Science University of Twente Enschede The Netherlands * There's a book-chapter preprint on parsing schemata which is related to the above. There's also a book of the same name, which is a revised version of the thesis. * GC88: MIMD shared-memory Earley. Ralph Grishman, Mahesh Chitrao. Evaluation of a parallel chart parser ([citeseer](http://citeseer.comp.nus.edu.sg/579833.html) * dV93b: measurements on par. impl. of CYK, Earley and DD. J.P.M. de Vreught. A practical comparison between parallel tabular recognizers. * CF84, Sij86, Tan83: VLSI Earley. * More VLSI Earley: **A Parallel Parsing VLSI Architecture for Arbitrary Context Free Grammars** Andreas Koulouris, Nectarios Koziris, Theodore Andronikos, George Papakonstantinou, Panayotis Tsanakas. 1998. * HdV91, IPS91: load balancing approaches to chart parsing. J. Hoogerbrugge and J.P.M. de Vreught. **Parallel recognizing in practice**. Ibarra, Pong, and Sohn, **Parallel recognition and parsing on the hyper-cube**. IEEE Transactions on Computers, 40(6):764-770, 1991. * The proposal for Berkeley's new parallel computing center (sponsored by Intel & MS) mentions the need for parallel parsing of web pages. I don't know what, if any, progress they've made in that direction. * More parallel earley. Includes description of algorithm (basically, bottom-up earley) together with proofs relating to running time and communication time. Also reports results on running an implemenation on a parallel-machine simulator. **A parallel parsing algorithm for arbitrary context-free grammars**. Dong-Yul Ra and Jong- Hyun Kim. * A Static Load-Balancing Scheme for Parallel XML Parsing on Multicore CPUs. Yinfei Pan, Wei Lu, Ying Zhang , Kenneth Chiu. A paper on parallel XML parsing. I've seen a few of these. This one is representative. * A paper on linear algebra on GPUs (http://www.cs.utexas.edu/users/flame/pubs/sc08.pdf ). Possibly relevant because chart parsing can be implemented as a form of matrix multiply. Yitzhak On Apr 20, 2009, at 5:35 PM, Mike Lin wrote: > There is certainly a reasonable body of basic CS research on > parallelizing CFG algorithms such as CYK, the Earley parser, and to a > lesser extent the more practical LALR strategy used by yacc etc. (In > the latter case it seems to get easier if you're willing to trade off > determinism when parsing ambiguous grammars.) > > I know some people who use some of this stuff in very specific > contexts (RNA folding), but I haven't seen any practical > general-purpose tools like a parallel yacc... > > Overall, I don't actually know much more than you could figure out > from Google Scholar in an hour but hopefully these were some useful > search terms. > > On Mon, Apr 20, 2009 at 5:15 PM, Jon Harrop <jon@ffconsultancy.com> > wrote: >> >> I'm desperately trying to prepare for the imminent drop of a rock- >> solid >> multicore-friendly OCaml implementation and was wondering what work >> has been >> done on parallelized parsers and/or parallel-friendly grammars? >> >> For example, Mathematica syntax for nested lists of integers looks >> like: >> >> {{{1, 2}}, {{3, 4}, {4, 5}}, ..} >> >> and there are obvious divide-and-conquer approaches to lexing and >> parsing that >> grammar. You can recursively subdivide the string (e.g. memory >> mapped from a >> file) to build a tree of where the tokens { , and } appear by index >> and then >> recursively convert the tree into an AST. >> >> What other grammars can be lexed and/or parsed efficiently in >> parallel? >> >> -- >> Dr Jon Harrop, Flying Frog Consultancy Ltd. >> http://www.ffconsultancy.com/?e >> >> _______________________________________________ >> Caml-list mailing list. Subscription management: >> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list >> Archives: http://caml.inria.fr >> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners >> Bug reports: http://caml.inria.fr/bin/caml-bugs >> > > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs ----------------------------- Yitzhak Mandelbaum ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Caml-list] Parallelized parsing 2009-04-21 0:52 ` Yitzhak Mandelbaum @ 2009-04-21 15:55 ` Jon Harrop 0 siblings, 0 replies; 8+ messages in thread From: Jon Harrop @ 2009-04-21 15:55 UTC (permalink / raw) To: caml-list On Tuesday 21 April 2009 01:52:40 Yitzhak Mandelbaum wrote: > Some good places to start are here: > ... Wow! Thanks for the link fest. :-) -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 8+ messages in thread
* Polymorphism problem 2009-04-20 21:15 Parallelized parsing Jon Harrop 2009-04-20 21:35 ` [Caml-list] " Mike Lin @ 2009-04-21 1:44 ` Eliot Handelman 2009-04-21 8:50 ` [Caml-list] " Mauricio Fernandez 2009-04-21 7:19 ` [Caml-list] Parallelized parsing David MENTRE 2 siblings, 1 reply; 8+ messages in thread From: Eliot Handelman @ 2009-04-21 1:44 UTC (permalink / raw) Cc: caml-list Hi list, Consider this: type 'a x = { x_v : 'a } and 'a y = { y_x : int kind; y_arr : 'a array } and 'a kind = X of 'a x | Y of 'a y I'd like to write a function _getter_ that's polymorphic over kind. This doesn't work, getting int kind -> int: let rec getter = function X x -> x.x_v | Y y -> y.y_arr.(getter y.y_x) which seems surprising to me since getter y.y_x is an intermediate value that's never returned. Is this just a limitation of the type system or does this result make sense? Here's my workaround: let rec int_getter = function X x -> x.x_v | Y y -> (int_getter y.y_x) let rec getter = function X x -> x.x_v | Y y -> y.y_arr.(int_getter y.y_x) where now the type of getter is 'a kind -> 'a as needed. I have no choice but to use this at present -- is there a better method? thanks for wisdom, -- eliot ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Caml-list] Polymorphism problem 2009-04-21 1:44 ` Polymorphism problem Eliot Handelman @ 2009-04-21 8:50 ` Mauricio Fernandez 0 siblings, 0 replies; 8+ messages in thread From: Mauricio Fernandez @ 2009-04-21 8:50 UTC (permalink / raw) To: eliot, caml-list On Mon, Apr 20, 2009 at 09:44:54PM -0400, Eliot Handelman wrote: > Consider this: > > type 'a x = { x_v : 'a } > > and 'a y = { y_x : int kind; > y_arr : 'a array > } > and 'a kind = > X of 'a x > | Y of 'a y > > > I'd like to write a function _getter_ that's polymorphic over kind. This > doesn't work, getting int kind -> int: > > let rec getter = function > X x -> x.x_v > | Y y -> y.y_arr.(getter y.y_x) > > which seems surprising to me since getter y.y_x is an intermediate > value that's never returned. Is this just a limitation of the type system or > does this result make sense? The above function requires polymorphic recursion, which OCaml doesn't support directly. There are several ways to encode it, though, one involving recursive modules and another rank-2 polymorphism: # module rec M : sig val getter : 'a kind -> 'a end = struct let getter = function X x -> x.x_v | Y y -> y.y_arr.(M.getter y.y_x) end;; module rec M : sig val getter : 'a kind -> 'a end # M.getter;; - : 'a kind -> 'a = <fun> # type get = { get : 'a. 'a kind -> 'a };; type get = { get : 'a. 'a kind -> 'a; } # let rec get = { get = function X x -> x.x_v | Y y -> y.y_arr.(get.get y.y_x) };; val get : get = {get = <fun>} # let getter = get.get;; val getter : 'a kind -> 'a = <fun> -- Mauricio Fernandez - http://eigenclass.org ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Caml-list] Parallelized parsing 2009-04-20 21:15 Parallelized parsing Jon Harrop 2009-04-20 21:35 ` [Caml-list] " Mike Lin 2009-04-21 1:44 ` Polymorphism problem Eliot Handelman @ 2009-04-21 7:19 ` David MENTRE 2009-04-21 16:04 ` Jon Harrop 2 siblings, 1 reply; 8+ messages in thread From: David MENTRE @ 2009-04-21 7:19 UTC (permalink / raw) To: Jon Harrop; +Cc: caml-list Hello Jon, On Mon, Apr 20, 2009 at 23:15, Jon Harrop <jon@ffconsultancy.com> wrote: > For example, Mathematica syntax for nested lists of integers looks like: > > {{{1, 2}}, {{3, 4}, {4, 5}}, ..} > > and there are obvious divide-and-conquer approaches to lexing and parsing that > grammar. You can recursively subdivide the string (e.g. memory mapped from a > file) to build a tree of where the tokens { , and } appear by index and then > recursively convert the tree into an AST. > > What other grammars can be lexed and/or parsed efficiently in parallel? Is it of any use? The overhead of parsing a single file in parallel is so high that you won't have any speedup, especially compared to the much simpler approach of parsing *several* files in parallel. It reminds me of parallel approaches used for 3D movies: a lot of research has been done to parallelize the rendering of a single picture[1] while companies like Pixar are using a much simpler approach in real life: render a whole picture per computer or core. And don't forget the Amdahl's law : http://en.wikipedia.org/wiki/Amdahl%27s_law Where is your real bottleneck? Yours, david [1] Hopefully, some of those algorithms have brought speedup in serialized setting, i.e. on a single core or computer, for example by optimizing cache use. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Caml-list] Parallelized parsing 2009-04-21 7:19 ` [Caml-list] Parallelized parsing David MENTRE @ 2009-04-21 16:04 ` Jon Harrop 0 siblings, 0 replies; 8+ messages in thread From: Jon Harrop @ 2009-04-21 16:04 UTC (permalink / raw) To: David MENTRE, caml-list On Tuesday 21 April 2009 08:19:33 you wrote: > On Mon, Apr 20, 2009 at 23:15, Jon Harrop <jon@ffconsultancy.com> wrote: > > For example, Mathematica syntax for nested lists of integers looks like: > > > > {{{1, 2}}, {{3, 4}, {4, 5}}, ..} > > > > and there are obvious divide-and-conquer approaches to lexing and parsing > > that grammar. You can recursively subdivide the string (e.g. memory > > mapped from a file) to build a tree of where the tokens { , and } appear > > by index and then recursively convert the tree into an AST. > > > > What other grammars can be lexed and/or parsed efficiently in parallel? > > Is it of any use? The overhead of parsing a single file in parallel is > so high that you won't have any speedup, especially compared to the > much simpler approach of parsing *several* files in parallel. I'm seeing near-linear speedups parsing nested Mathematica lists in F# using the algorithms I described. Moreover, dumping large quantities of data from Mathematica in that format is much easier than dividing it into separate files because there is no clear break in the tree. -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2009-04-21 15:57 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-04-20 21:15 Parallelized parsing Jon Harrop 2009-04-20 21:35 ` [Caml-list] " Mike Lin 2009-04-21 0:52 ` Yitzhak Mandelbaum 2009-04-21 15:55 ` Jon Harrop 2009-04-21 1:44 ` Polymorphism problem Eliot Handelman 2009-04-21 8:50 ` [Caml-list] " Mauricio Fernandez 2009-04-21 7:19 ` [Caml-list] Parallelized parsing David MENTRE 2009-04-21 16:04 ` Jon Harrop
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox