* [Caml-list] Announcement: PXP 1.1.92 (development version) @ 2002-09-01 1:45 Gerd Stolpmann 2002-09-01 8:52 ` John Max Skaller 0 siblings, 1 reply; 4+ messages in thread From: Gerd Stolpmann @ 2002-09-01 1:45 UTC (permalink / raw) To: caml-list Hi list, there is a new development version of PXP: 1.1.92. This version focuses on cleaning up the way lexers are generated. There is a new tool, lexpp, that generates the lexers from only five files. Furthermore, much more 8 bit character sets are now supported as internal encodings. In previous versions of PXP, the internal representation of the XML trees was restricted to either UTF-8 or ISO-8859-1. Now, a number of additional encodings are supported, including the whole ISO-8859 series. Bugfix: If the processing instruction <?xml...?> occurs in the middle of the XML document, version 1.1.91 will immediately stop parsing, and ignore the rest of the file. This is now fixed. The new version can be found at the usual place: http://www.ocaml-programming.de/packages/pxp-1.1.92.tar.gz Gerd ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de ------------------------------------------------------------ ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Caml-list] Announcement: PXP 1.1.92 (development version) 2002-09-01 1:45 [Caml-list] Announcement: PXP 1.1.92 (development version) Gerd Stolpmann @ 2002-09-01 8:52 ` John Max Skaller 2002-09-01 11:57 ` Yamagata Yoriyuki 0 siblings, 1 reply; 4+ messages in thread From: John Max Skaller @ 2002-09-01 8:52 UTC (permalink / raw) To: Gerd Stolpmann; +Cc: caml-list Gerd Stolpmann wrote: > previous versions of PXP, the internal representation of the XML trees was > restricted to either UTF-8 or ISO-8859-1. Now, a number of additional > encodings are supported, including the whole ISO-8859 series. I have ALL the code sets specified at Unicode.org in programmatic form. Easy to generate Ocaml versions of the tables. however, how about developing a standard I18n library with an eye to future inclusion in the standard distribution? The questions are mainly: what form should the encode/decode functions take? My functions are in Python, and take the form: decode: string -> (int * string) encode: int -> string where string is an 8 bit byte stream, and int is a unicode (or other) code point. The actual python functions use dynamically loaded data tables, but each character set has a fixed format for the tables that knows about the raw structure of the character set (eg what ranges of hi and low bytes are allowed in two byte encodings of Shift-Jis, KSC, etc). For Ocaml, we'd probably want to bind the encodings at compile time (since there is no well defined way to find the data tables at run time :( The tables are very compact, but there are quite a few encodings -- some overhead if they're all in the one module .. -- John Max Skaller, mailto:skaller@ozemail.com.au snail:10/1 Toxteth Rd, Glebe, NSW 2037, Australia. voice:61-2-9660-0850 ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Caml-list] Announcement: PXP 1.1.92 (development version) 2002-09-01 8:52 ` John Max Skaller @ 2002-09-01 11:57 ` Yamagata Yoriyuki 2002-09-01 13:54 ` John Max Skaller 0 siblings, 1 reply; 4+ messages in thread From: Yamagata Yoriyuki @ 2002-09-01 11:57 UTC (permalink / raw) To: caml-list From: John Max Skaller <skaller@ozemail.com.au> Subject: Re: [Caml-list] Announcement: PXP 1.1.92 (development version) Date: Sun, 01 Sep 2002 18:52:20 +1000 > I have ALL the code sets specified at Unicode.org in > programmatic form. Easy to generate Ocaml versions > of the tables. Data at Unicode.org for East Asian encodings are buggy. Don't use them. (Moreover, Unicode Consortium declared they don't want to fix these bugs, and make East Asian mapping tables obsolete. see ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/README.TXT) I uses mapping tables from glibc for my camomile, which seems more debugged. > My functions are in Python, and take the form: > > decode: string -> (int * string) > encode: int -> string > > where string is an 8 bit byte stream, > and int is a unicode (or other) code point. This interface has a problem with stateful encodings, which are quite important here. (ISO-2020-JP or JIS encoding is stateful, and standard encoding for email.) In addition, it is inefficient. > The actual python functions use dynamically loaded > data tables, but each character set has a fixed > format for the tables that knows about the raw > structure of the character set (eg what ranges of > hi and low bytes are allowed in two byte encodings > of Shift-Jis, KSC, etc). For Ocaml, we'd probably > want to bind the encodings at compile time > (since there is no well defined way to find > the data tables at run time :( > > The tables are very compact, but there are quite > a few encodings -- some overhead if they're all > in the one module .. I read somewhere that Perl6 delegates code conversion to add-on programs, since making standard mapping tables is really hard. (Even naming of encodings is a problem. There is no cross-platform way of this.) Introducing generic channel type (for char and unicode character) and letting 3rd party libraries do conversion is better solution, IMO. -- Yamagata Yoriyuki http://www.mars.sphere.ne.jp/yoriyuki/ PGP fingerprint = 0374 5290 7445 4C06 D79E AA86 1A91 48CB 2B4E 34CF ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Caml-list] Announcement: PXP 1.1.92 (development version) 2002-09-01 11:57 ` Yamagata Yoriyuki @ 2002-09-01 13:54 ` John Max Skaller 0 siblings, 0 replies; 4+ messages in thread From: John Max Skaller @ 2002-09-01 13:54 UTC (permalink / raw) To: Yamagata Yoriyuki; +Cc: caml-list Yamagata Yoriyuki wrote: > Data at Unicode.org for East Asian encodings are buggy. Don't use > them. Noted. >>My functions are in Python, and take the form: >> >> decode: string -> (int * string) >> encode: int -> string >> >>where string is an 8 bit byte stream, >>and int is a unicode (or other) code point. >> > > This interface has a problem with stateful encodings, which are quite > important here. (ISO-2020-JP or JIS encoding is stateful, and > standard encoding for email.) In addition, it is inefficient. Agree on both counts, though none of the encodings I handle are stateful (I handle Shift-Jis which isn't stateful AFAIK) The functions I give are canonical, and they're fast enough in Python (if you want fast, you'd use C anyhow). There is an issue for Ocaml: what is a Unicode string like? My answer would be 'array of int'. But another answer is 'string with UTF-8 encoding'. In theory, mappings and codecs are orthogonal. UTF-8 has nothing to do with Unicode, it works just fine for any national character set. In practice, many character sets are defined by two byte encodings. So you might want a function: Shift-Jis -> Unicode as UTF-8 modelled by string -> string (8 bit clean strings) That can be made from the canonical functions, but it isn't efficient to do the conversion via an integer intermediate form. > I read somewhere that Perl6 delegates code conversion to add-on > programs, since making standard mapping tables is really hard. > (Even naming of encodings is a problem. There is no cross-platform > way of this.) Introducing generic channel type (for char and unicode > character) and letting 3rd party libraries do conversion is better > solution, IMO. Well, you also want in-core conversions. And then a third party library is an arbitrary function. The problem is that people are rewriting these functions for each application that needs some i18n support. Reuse would be better, but that requires some form of standardisation. Its both hard to get the conversions right, and also to make them efficient. I spent ages converting the unicode.org data (I also found a bug in the UNICODE tables). The problem is: 'third party libraries' might be a reasonable answer for a C program. Its not so reasonable for Ocaml: where are they? We're short of useful libraries .. indeed, for a mechanism to install and access them. -- John Max Skaller, mailto:skaller@ozemail.com.au snail:10/1 Toxteth Rd, Glebe, NSW 2037, Australia. voice:61-2-9660-0850 ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2002-09-01 13:54 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2002-09-01 1:45 [Caml-list] Announcement: PXP 1.1.92 (development version) Gerd Stolpmann 2002-09-01 8:52 ` John Max Skaller 2002-09-01 11:57 ` Yamagata Yoriyuki 2002-09-01 13:54 ` John Max Skaller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox