Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed
From: pjfrey@sympatico.ca
To: caml-list@inria.fr
Subject: [Caml-list] Rope.of_string looses characters
Date: Thu, 21 Jul 2011 02:13:22 +0200	[thread overview]
Message-ID: <sympa.1311206914.27933.773@inria.fr> (raw)
In-Reply-To: 

I am encountering a very baffling problem with Batteries:
BatRope.of_string looses characters when converting strings longer then
about 250 characters. This is so hard to imagine that I have written a
small demo program:

  let filename =  (try ((Sys.argv.(1))) with _ -> "chinese.txt") in
  let size = int_of_string (try ((Sys.argv.(2))) with _ -> "0") in
		(*  read some file into a string; this works fine *)
  let fd_in = BatFile.open_in filename in
  let fs = BatFile.size_of filename in
  let file_size = if size = 0 then fs else (min size fs) in
  let filestring = String.create file_size in
  let _ = ignore (BatIO.really_input fd_in filestring 0 file_size) in
  BatIO.close_in fd_in;
	  (* convert string to rope and back; rope MISSES CHARACTERS *)
  let rope = BatRope.of_string filestring in
  let rLen = BatRope.length rope in
  let reconverted = BatRope.to_string rope in
  let sLen = String.length reconverted in printf
   "len text:%i BatRope.of_string len:%i BatRope.to_string len:%i in" 
	   file_size		     rLen		      sLen;
  printf"Number of missing characters after BatRope.to_string:%i
in" (file_size-sLen);
  let x = filestring in let size' = sLen in let x' = reconverted in
  let rec show ix dropped lastError =
    if (ix - dropped ) = size' then () 
    else begin
      if x.[ix] <> x'.[ix - dropped] then begin
	printf"%i,%i\t%3i\n" ix dropped (ix - lastError); 
	show (succ ix) (succ dropped) ix
      end else 
	(* printf"%c" x.[ix];  *)
	show (succ ix) dropped lastError
    end  
  in show 0 0 0

./a.out ISO-8859-1 10000
len text:5000 BatRope.of_string len:4981 BatRope.to_string len:4981
Number of missing characters after BatRope.to_string:19
256,0	256
513,1	257
770,2	257
1030,3	260
1284,4	254
1548,5	264
1802,6	254
2057,7	255
2312,8	255
2569,9	257
2826,10 257
3083,11 257
3340,12 257
3597,13 257
3860,14 263
4111,15 251
4368,16 257
4625,17 257
4882,18 257

The list above shows the indexes of the dropped characters, the count of
dropped characters and the difference in the index between occurences of
misses.
With chinese text it looses entire characters so the list above looks
like:
738,0	738
739,1	  1
740,2	  1
1400,3	660
1401,4	  1
1402,5	  1
2024,6	622
2653,7	629
3362,8	709
3363,9	  1
3364,10   1
... thus it looses complete 3-byte characters, as expected.

It appears that BatRope.of_string splits the file into chunks, that are
not correctly re_assembled, but that would affect a lot of other code...

Please, somebody tell me I am imagining things and there is a silly
error in above code.

Peter Frey

             reply	other threads:[~2011-07-21  0:13 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-07-21  0:13 pjfrey [this message]
2011-07-21  2:37 ` Edgar Friendly

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=sympa.1311206914.27933.773@inria.fr \
    --to=pjfrey@sympatico.ca \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox