From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id q1GAIH9K022620 for ; Thu, 16 Feb 2012 11:18:18 +0100 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AucAAIHXPE/VBIoRi2dsb2JhbABDsQwBAQEKCwsHEgYjgXIBAQQBJwsBSwsLGC4UKIgzA7Vvi1QIJhIDAgULBAUFAQIBCAQECw0DAoNtAi0aHoI6YwSVNQGSaQ X-IronPort-AV: E=Sophos;i="4.73,428,1325458800"; d="scan'208";a="131611816" Received: from impaqm1.telefonica.net (HELO telefonica.net) ([213.4.138.17]) by mail4-smtp-sop.national.inria.fr with ESMTP; 16 Feb 2012 11:18:16 +0100 Received: from IMPmailhost2.adm.correo ([10.20.102.39]) by IMPaqm1.telefonica.net with bizsmtp id aXuj1i0210r0BT63MaJFK1; Thu, 16 Feb 2012 11:18:15 +0100 Received: from NANA.localdomain ([83.50.99.94]) by IMPmailhost2.adm.correo with BIZ IMP id aaJE1i00d22BHaB1iaJEzz; Thu, 16 Feb 2012 11:18:15 +0100 X-Brightmail-Tracker: AAAAAA== X-original-sender: ferferse@telefonica.net Received: from mfp by NANA.localdomain with local (Exim 4.72) (envelope-from ) id 1RxyPm-0002Fx-2T for caml-list@inria.fr; Thu, 16 Feb 2012 11:18:14 +0100 Date: Thu, 16 Feb 2012 11:18:13 +0100 From: Mauricio Fernandez To: caml-list@inria.fr Message-ID: <20120216101813.GA11489@NANA.localdomain> Mail-Followup-To: caml-list@inria.fr References: <32FE3555-556A-43AC-8B1B-9A4AF08DBA02@strauss-acoustics.ch> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <32FE3555-556A-43AC-8B1B-9A4AF08DBA02@strauss-acoustics.ch> User-Agent: Mutt/1.5.17+20080114 (2008-01-14) Subject: Re: [Caml-list] ocaml-pcre and UTF-8 On Thu, Feb 16, 2012 at 10:29:30AM +0100, Philippe Strauss wrote: > Hello caml'ers, > > How do I convince PCRE to be UTF-8 friendly? example: > > -- > open Pcre > > external show : 'a -> string = "%show" As an aside: where did you get this external from? I had to write a proper show function on 3.12.0 in order to compile your example. > let recomp = regexp ~flags:[`UTF8; `CASELESS] > > let res_w = "(*UTF8)^(\w+)$" ===== It would be \\w if anything, but the pcre manual warns that 6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test characters of any code value, but, by default, the characters that PCRE recognizes as digits, spaces, or word characters remain the same set as before, all with values less than 256. This remains true even when PCRE is built to include Unicode property support, because to do otherwise would slow down PCRE in many common cases. Note in particular that this applies to \b and \B, because they are defined in terms of \w and \W. If you really want to test for a wider sense of, say, "digit", you can use explicit Unicode property tests such as \p{Nd}. Alternatively, if you set the PCRE_UCP option, the way that the character escapes work is changed so that Unicode properties are used to determine which characters match. There are more details in the section on generic character types in the pcrepattern documentation. and pcrepattern lists this Unicode property: Xwd Any Perl "word" character so, given a suitable show function, both let res_w = "^(\\p{Xwd}+)$" and let res_w = "(*UCP)^(\\w+)$" yield ./pcre_utf config_utf8=true [|Some blurb|] [|Some toxicité|] [|Some velléités|] [|Some à|] [|Some où|] [|Some über|] Not_found was raised on "marie-jeanne" :-( ('-' not being a "word character" in my locale) -- Mauricio Fernandez - http://eigenclass.org