From: pjfrey@sympatico.ca
To: caml-list@inria.fr
Subject: [Caml-list] Rope.of_string looses characters
Date: Thu, 21 Jul 2011 02:13:22 +0200 [thread overview]
Message-ID: <sympa.1311206914.27933.773@inria.fr> (raw)
In-Reply-To:
I am encountering a very baffling problem with Batteries:
BatRope.of_string looses characters when converting strings longer then
about 250 characters. This is so hard to imagine that I have written a
small demo program:
let filename = (try ((Sys.argv.(1))) with _ -> "chinese.txt") in
let size = int_of_string (try ((Sys.argv.(2))) with _ -> "0") in
(* read some file into a string; this works fine *)
let fd_in = BatFile.open_in filename in
let fs = BatFile.size_of filename in
let file_size = if size = 0 then fs else (min size fs) in
let filestring = String.create file_size in
let _ = ignore (BatIO.really_input fd_in filestring 0 file_size) in
BatIO.close_in fd_in;
(* convert string to rope and back; rope MISSES CHARACTERS *)
let rope = BatRope.of_string filestring in
let rLen = BatRope.length rope in
let reconverted = BatRope.to_string rope in
let sLen = String.length reconverted in printf
"len text:%i BatRope.of_string len:%i BatRope.to_string len:%i in"
file_size rLen sLen;
printf"Number of missing characters after BatRope.to_string:%i
in" (file_size-sLen);
let x = filestring in let size' = sLen in let x' = reconverted in
let rec show ix dropped lastError =
if (ix - dropped ) = size' then ()
else begin
if x.[ix] <> x'.[ix - dropped] then begin
printf"%i,%i\t%3i\n" ix dropped (ix - lastError);
show (succ ix) (succ dropped) ix
end else
(* printf"%c" x.[ix]; *)
show (succ ix) dropped lastError
end
in show 0 0 0
./a.out ISO-8859-1 10000
len text:5000 BatRope.of_string len:4981 BatRope.to_string len:4981
Number of missing characters after BatRope.to_string:19
256,0 256
513,1 257
770,2 257
1030,3 260
1284,4 254
1548,5 264
1802,6 254
2057,7 255
2312,8 255
2569,9 257
2826,10 257
3083,11 257
3340,12 257
3597,13 257
3860,14 263
4111,15 251
4368,16 257
4625,17 257
4882,18 257
The list above shows the indexes of the dropped characters, the count of
dropped characters and the difference in the index between occurences of
misses.
With chinese text it looses entire characters so the list above looks
like:
738,0 738
739,1 1
740,2 1
1400,3 660
1401,4 1
1402,5 1
2024,6 622
2653,7 629
3362,8 709
3363,9 1
3364,10 1
... thus it looses complete 3-byte characters, as expected.
It appears that BatRope.of_string splits the file into chunks, that are
not correctly re_assembled, but that would affect a lot of other code...
Please, somebody tell me I am imagining things and there is a silly
error in above code.
Peter Frey
next reply other threads:[~2011-07-21 0:13 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-07-21 0:13 pjfrey [this message]
2011-07-21 2:37 ` Edgar Friendly
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=sympa.1311206914.27933.773@inria.fr \
--to=pjfrey@sympatico.ca \
--cc=caml-list@inria.fr \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox