From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: * X-Spam-Status: No, score=1.4 required=5.0 tests=SPF_NEUTRAL autolearn=disabled version=3.1.3 Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by yquem.inria.fr (Postfix) with ESMTP id 21EE7BC69 for ; Mon, 15 Oct 2007 22:34:26 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAAFtrE0dA6ba/kmdsb2JhbACOSAEBAQEHAgYJChaBKQ X-IronPort-AV: E=Sophos;i="4.21,278,1188770400"; d="scan'208";a="18043697" Received: from nf-out-0910.google.com ([64.233.182.191]) by mail4-smtp-sop.national.inria.fr with ESMTP; 15 Oct 2007 22:34:25 +0200 Received: by nf-out-0910.google.com with SMTP id e27so1277369nfd for ; Mon, 15 Oct 2007 13:34:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:received:date:to:subject:message-id:references:mime-version:content-type:content-disposition:in-reply-to:user-agent:from; bh=CkKfLudq2iFmoNLijkzJCIje7/Uaeh65LwO97hW0BeU=; b=PotBNmGPZXRj8EoNYOhQ9qqqbBG2IAt9wfUu52I+u214kEpa7lDyzEbZMBht3OnCwuG84/aU4ubnZ4VIAcpBxWCJ/ZEu3R+i5ll6VmFFLdI/qETh1b9gKV8BZaOZdnz+9lu48hqsjwBZS9AnqUXubpzrlb8b6fXquBFeEVeoxVg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:date:to:subject:message-id:references:mime-version:content-type:content-disposition:in-reply-to:user-agent:from; b=kBxFMdFbz1Rwx4N5Tm02V+5eJ54TkJ3/I5jDRFFtMBpoE/lpD2eRM9FCiZ6Fy8cympzkh3W/abOWTiAk4sZ4Mp/wzeH5zYOo+dibB1ZFU9cDLYI74J6eG1RrZvg46kr39YnYHxFM+ir4UcglSKtoY5s06gXc6HF5c/sYRXPzhd8= Received: by 10.78.142.14 with SMTP id p14mr4309332hud.1192480463787; Mon, 15 Oct 2007 13:34:23 -0700 (PDT) Received: from localhost ( [83.201.28.205]) by mx.google.com with ESMTPS id c9sm5565874nfi.2007.10.15.13.34.14 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 15 Oct 2007 13:34:22 -0700 (PDT) Received: by localhost (sSMTP sendmail emulation); Mon, 15 Oct 2007 22:35:10 +0200 Date: Mon, 15 Oct 2007 22:35:10 +0200 To: caml-list@yquem.inria.fr Subject: Warning on home-made functions dealing with UTF-8. Message-ID: <20071015203509.GA5212@localhost> References: <20071009155702.GB26282@snarc.org> <6f9f8f4a0710090942u2afe6c5erc2d5b11ecfff2253@mail.gmail.com> <20071009165520.GA27507@snarc.org> <6f9f8f4a0710091032r3dace80bi4f9b584ae5056675@mail.gmail.com> <20071009195119.GA29263@snarc.org> <875c7e070710091504j203ff78y8703fc5c22086671@mail.gmail.com> <20071011130354.GA7016@snarc.org> <1192110864.6045.12.camel@rosella.wigram> <20071011142141.GA8001@snarc.org> <1192114097.6184.7.camel@rosella.wigram> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <1192114097.6184.7.camel@rosella.wigram> User-Agent: Mutt/1.5.16 (2007-06-11) From: Julien Moutinho X-Spam: no; 0.00; home-made:01 0200,:01 camomile:01 camomile:01 val:01 val:01 byte:01 wikipedia:01 wiki:01 char:01 char:01 wrote:01 wrote:01 functions:01 functions:01 On Fri, Oct 12, 2007 at 12:48:16AM +1000, skaller wrote: > On Thu, 2007-10-11 at 16:21 +0200, Vincent Hanquez wrote: > > On Thu, Oct 11, 2007 at 11:54:24PM +1000, skaller wrote: > > > You can't: Camomile is massive for a reason.. the problem it > > > aims to solve is complex and hard to do efficiently without > > > a large set of specialised functions. > > > > You are assuming that i want efficiency where i want to print few > > unicode string in an ui here and there. I *DON'T* want to be exposed to > > full unicode, i need something like 1/100 of camomile library. > > In that case, you can use an int Array.t for Unicode provided > it is only 31 bit OR you have a 64 bit machine. These routines > should help converting to and from UTF-8: > [...] Just in case someone would want to use this parse_utf8, be aware that depending on the trust you have in your input, it may be sorely discouraged to do so. Indeed, this code does not check comprehensively for invalid characters. eg. for characters with an overlong form [1]: # let mk = List.fold_left (fun acc c -> acc ^ String.make 1 (Char.chr c)) "";; val mk : int list -> string = # let p l = parse_utf8 (mk l) 0;; val p : int list -> int * int = (* unicode 0 coded into an overlong utf-8 form *) # p [0b11_000000; 0b10_000000];; - : int * int = (0, 2) Nor does it checks for invalid trailing bytes : (* unicode 64 (@) with and invalid trailing byte, * which happens to be a zero *) # p [0b11_000001; 0b00_00000];; - : int * int = (64, 2) Besides "now" an unicode value needs only 21 bits and "therefore" an utf-8 char holds into at most 4 bytes, not 6 as the code handles. [1] http://en.wikipedia.org/wiki/UTF-8#Overlong_forms.2C_invalid_input.2C_and_security_considerations