From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by sympa.inria.fr (Postfix) with ESMTPS id 8DDA97EEAF for ; Fri, 18 Jan 2013 16:38:56 +0100 (CET) Received-SPF: None (mail1-smtp-roc.national.inria.fr: no sender authenticity information available from domain of alain.frisch@lexifi.com) identity=pra; client-ip=193.252.23.212; receiver=mail1-smtp-roc.national.inria.fr; envelope-from="alain.frisch@lexifi.com"; x-sender="alain.frisch@lexifi.com"; x-conformance=sidf_compatible Received-SPF: None (mail1-smtp-roc.national.inria.fr: no sender authenticity information available from domain of alain.frisch@lexifi.com) identity=mailfrom; client-ip=193.252.23.212; receiver=mail1-smtp-roc.national.inria.fr; envelope-from="alain.frisch@lexifi.com"; x-sender="alain.frisch@lexifi.com"; x-conformance=sidf_compatible Received-SPF: None (mail1-smtp-roc.national.inria.fr: no sender authenticity information available from domain of postmaster@msa.smtpout.orange.fr) identity=helo; client-ip=193.252.23.212; receiver=mail1-smtp-roc.national.inria.fr; envelope-from="alain.frisch@lexifi.com"; x-sender="postmaster@msa.smtpout.orange.fr"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ArUAAE5r+VDB/BfUkWdsb2JhbABFgziDDbgJDgEBAQEJCwsHFAMkgh4BAQUjDwEFQAEQCxgCAgUWCwICCQMCAQIBRQYNAQcBAQWIFAiqN5FmgSOPA4ETA5YMgRyET41U X-IronPort-AV: E=Sophos;i="4.84,493,1355094000"; d="scan'208";a="190592446" Received: from msa03.smtpout.orange.fr (HELO msa.smtpout.orange.fr) ([193.252.23.212]) by mail1-smtp-roc.national.inria.fr with ESMTP; 18 Jan 2013 16:38:56 +0100 Received: from [192.168.1.105] ([90.44.19.86]) by mwinf5d28 with ME id pTen1k00N1rRe5E03Terti; Fri, 18 Jan 2013 16:38:56 +0100 Message-ID: <50F96C89.3030905@lexifi.com> Date: Fri, 18 Jan 2013 16:38:49 +0100 From: Alain Frisch Organization: LexiFi User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:17.0) Gecko/20130107 Thunderbird/17.0.2 MIME-Version: 1.0 To: =?UTF-8?B?RGFuaWVsIELDvG56bGk=?= CC: caml-list References: <50F95B3B.4050607@lexifi.com> <21CE18DB888F4112A274E2BCBF1D0B6B@erratique.ch> In-Reply-To: <21CE18DB888F4112A274E2BCBF1D0B6B@erratique.ch> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Validation-by: alain.frisch@lexifi.com Subject: Re: [Caml-list] sedlex = ulex without camlp4 I have to admit that I don't know much about Unicode and surrogates (moreover, support for utf-16 was contributed by someone else). I'll happily update the documentation if someone looks at the source code and tells me that the property you mention indeed holds. -- Alain On 01/18/2013 04:32 PM, Daniel Bünzli wrote: > Hello Alain, > > I rapidly went through your documentation. > > If your UTF-8 and UTF-16 decoders are conformant, your module, on output, doesn't generate Unicode code points, but Unicode scalar values (code points minus the UTF-16 surrogates [1]). If that is the case it would be nice to state this invariant explicitely in the documentation. > > This allows to directly pass the data generated by sedlex to modules like Uunf without further checks as those values belong to the Uunf.uchar type [2]. > > Best, > > Daniel > > > [1] http://www.unicode.org/glossary/#unicode_scalar_value > [2] http://erratique.ch/software/uunf/doc/Uunf#TYPEuchar > > >