* [Caml-list] ocaml-pcre and UTF-8
@ 2012-02-16 9:29 Philippe Strauss
2012-02-16 10:18 ` Mauricio Fernandez
2012-02-16 10:18 ` Philippe Strauss
0 siblings, 2 replies; 3+ messages in thread
From: Philippe Strauss @ 2012-02-16 9:29 UTC (permalink / raw)
To: caml-list
Hello caml'ers,
How do I convince PCRE to be UTF-8 friendly? example:
--
open Pcre
external show : 'a -> string = "%show"
let recomp = regexp ~flags:[`UTF8; `CASELESS]
let res_w = "(*UTF8)^(\w+)$"
let rec_w = recomp res_w
let accents = ["blurb"; "toxicité"; "velléités"; "à"; "où"; "über"; "marie-jeanne"]
let () =
Printf.printf "config_utf8=%b\n" Pcre.config_utf8 ;
List.iter (fun word ->
try
let sub = Pcre.extract_opt ~full_match:false ~rex:rec_w word in
print_endline (show sub)
with Not_found -> Printf.eprintf "Not_found was raised on \"%s\" :-(\n%!" word
) accents
--
at least on my setup it gives:
--
philou@air:~/mysrc/web/myco$ ./pcre_utf
config_utf8=true
[|Some ("blurb")|]
Not_found was raised on "toxicité" :-(
Not_found was raised on "velléités" :-(
Not_found was raised on "à" :-(
Not_found was raised on "où" :-(
Not_found was raised on "über" :-(
Not_found was raised on "marie-jeanne" :-(
--
A php user which happen to be a nice guy got it working on PHP :-( how lame.
--
Philippe Strauss
http://www.strauss-acoustics.ch/
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [Caml-list] ocaml-pcre and UTF-8
2012-02-16 9:29 [Caml-list] ocaml-pcre and UTF-8 Philippe Strauss
@ 2012-02-16 10:18 ` Mauricio Fernandez
2012-02-16 10:18 ` Philippe Strauss
1 sibling, 0 replies; 3+ messages in thread
From: Mauricio Fernandez @ 2012-02-16 10:18 UTC (permalink / raw)
To: caml-list
On Thu, Feb 16, 2012 at 10:29:30AM +0100, Philippe Strauss wrote:
> Hello caml'ers,
>
> How do I convince PCRE to be UTF-8 friendly? example:
>
> --
> open Pcre
>
> external show : 'a -> string = "%show"
As an aside: where did you get this external from? I had to write a proper
show function on 3.12.0 in order to compile your example.
> let recomp = regexp ~flags:[`UTF8; `CASELESS]
>
> let res_w = "(*UTF8)^(\w+)$"
=====
It would be \\w if anything, but the pcre manual warns that
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
test characters of any code value, but, by default, the characters
that PCRE recognizes as digits, spaces, or word characters remain the
same set as before, all with values less than 256. This remains
true even when PCRE is built to include Unicode property support,
because to do otherwise would slow down PCRE in many common cases. Note
in particular that this applies to \b and \B, because they are defined
in terms of \w and \W. If you really want to test for a wider sense of,
say, "digit", you can use explicit Unicode property tests such as
\p{Nd}. Alternatively, if you set the PCRE_UCP option, the way that the
character escapes work is changed so that Unicode properties are used
to determine which characters match. There are more details in the
section on generic character types in the pcrepattern documentation.
and pcrepattern lists this Unicode property:
Xwd Any Perl "word" character
so, given a suitable show function, both
let res_w = "^(\\p{Xwd}+)$"
and
let res_w = "(*UCP)^(\\w+)$"
yield
./pcre_utf
config_utf8=true
[|Some blurb|]
[|Some toxicité|]
[|Some velléités|]
[|Some à|]
[|Some où|]
[|Some über|]
Not_found was raised on "marie-jeanne" :-(
('-' not being a "word character" in my locale)
--
Mauricio Fernandez - http://eigenclass.org
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [Caml-list] ocaml-pcre and UTF-8
2012-02-16 9:29 [Caml-list] ocaml-pcre and UTF-8 Philippe Strauss
2012-02-16 10:18 ` Mauricio Fernandez
@ 2012-02-16 10:18 ` Philippe Strauss
1 sibling, 0 replies; 3+ messages in thread
From: Philippe Strauss @ 2012-02-16 10:18 UTC (permalink / raw)
To: caml-list
Oh found something on the PLEAC and man pcrepattern :
let res_w = "^([\\p{Latin}\\-]+)$"
Le 16 févr. 2012 à 10:29, Philippe Strauss a écrit :
> Hello caml'ers,
>
> How do I convince PCRE to be UTF-8 friendly? example:
>
> --
> open Pcre
>
> external show : 'a -> string = "%show"
>
> let recomp = regexp ~flags:[`UTF8; `CASELESS]
>
> let res_w = "(*UTF8)^(\w+)$"
> let rec_w = recomp res_w
>
> let accents = ["blurb"; "toxicité"; "velléités"; "à"; "où"; "über"; "marie-jeanne"]
>
> let () =
> Printf.printf "config_utf8=%b\n" Pcre.config_utf8 ;
> List.iter (fun word ->
> try
> let sub = Pcre.extract_opt ~full_match:false ~rex:rec_w word in
> print_endline (show sub)
> with Not_found -> Printf.eprintf "Not_found was raised on \"%s\" :-(\n%!" word
> ) accents
> --
>
> at least on my setup it gives:
>
> --
> philou@air:~/mysrc/web/myco$ ./pcre_utf
> config_utf8=true
> [|Some ("blurb")|]
> Not_found was raised on "toxicité" :-(
> Not_found was raised on "velléités" :-(
> Not_found was raised on "à" :-(
> Not_found was raised on "où" :-(
> Not_found was raised on "über" :-(
> Not_found was raised on "marie-jeanne" :-(
> --
>
> A php user which happen to be a nice guy got it working on PHP :-( how lame.
>
> --
> Philippe Strauss
> http://www.strauss-acoustics.ch/
>
>
>
>
>
>
>
> --
> Caml-list mailing list. Subscription management and archives:
> https://sympa-roc.inria.fr/wws/info/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>
--
Philippe Strauss
http://www.strauss-acoustics.ch/
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2012-02-16 10:19 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-16 9:29 [Caml-list] ocaml-pcre and UTF-8 Philippe Strauss
2012-02-16 10:18 ` Mauricio Fernandez
2012-02-16 10:18 ` Philippe Strauss
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox