From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id p6L0DSsb032288 for ; Thu, 21 Jul 2011 02:13:28 +0200 X-IronPort-AV: E=Sophos;i="4.67,238,1309730400"; d="scan'208";a="113580121" Received: from walapai.inria.fr ([128.93.30.24]) by mail1-relais-roc.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-SHA; 21 Jul 2011 02:13:23 +0200 Received: from walapai.inria.fr (localhost [127.0.0.1]) by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id p6L0DNXL032285 for ; Thu, 21 Jul 2011 02:13:23 +0200 Received: (from sympa@localhost) by walapai.inria.fr (8.13.6/8.12.10/Submit) id p6L0DMxc032284 for caml-list@inria.fr; Thu, 21 Jul 2011 02:13:22 +0200 Date: Thu, 21 Jul 2011 02:13:22 +0200 X-Authentication-Warning: walapai.inria.fr: sympa set sender to sympa@inria.fr using -f To: caml-list@inria.fr From: pjfrey@sympatico.ca In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Subject: [Caml-list] Rope.of_string looses characters I am encountering a very baffling problem with Batteries: BatRope.of_string looses characters when converting strings longer then about 250 characters. This is so hard to imagine that I have written a small demo program: let filename = (try ((Sys.argv.(1))) with _ -> "chinese.txt") in let size = int_of_string (try ((Sys.argv.(2))) with _ -> "0") in (* read some file into a string; this works fine *) let fd_in = BatFile.open_in filename in let fs = BatFile.size_of filename in let file_size = if size = 0 then fs else (min size fs) in let filestring = String.create file_size in let _ = ignore (BatIO.really_input fd_in filestring 0 file_size) in BatIO.close_in fd_in; (* convert string to rope and back; rope MISSES CHARACTERS *) let rope = BatRope.of_string filestring in let rLen = BatRope.length rope in let reconverted = BatRope.to_string rope in let sLen = String.length reconverted in printf "len text:%i BatRope.of_string len:%i BatRope.to_string len:%i in" file_size rLen sLen; printf"Number of missing characters after BatRope.to_string:%i in" (file_size-sLen); let x = filestring in let size' = sLen in let x' = reconverted in let rec show ix dropped lastError = if (ix - dropped ) = size' then () else begin if x.[ix] <> x'.[ix - dropped] then begin printf"%i,%i\t%3i\n" ix dropped (ix - lastError); show (succ ix) (succ dropped) ix end else (* printf"%c" x.[ix]; *) show (succ ix) dropped lastError end in show 0 0 0 ./a.out ISO-8859-1 10000 len text:5000 BatRope.of_string len:4981 BatRope.to_string len:4981 Number of missing characters after BatRope.to_string:19 256,0 256 513,1 257 770,2 257 1030,3 260 1284,4 254 1548,5 264 1802,6 254 2057,7 255 2312,8 255 2569,9 257 2826,10 257 3083,11 257 3340,12 257 3597,13 257 3860,14 263 4111,15 251 4368,16 257 4625,17 257 4882,18 257 The list above shows the indexes of the dropped characters, the count of dropped characters and the difference in the index between occurences of misses. With chinese text it looses entire characters so the list above looks like: 738,0 738 739,1 1 740,2 1 1400,3 660 1401,4 1 1402,5 1 2024,6 622 2653,7 629 3362,8 709 3363,9 1 3364,10 1 ... thus it looses complete 3-byte characters, as expected. It appears that BatRope.of_string splits the file into chunks, that are not correctly re_assembled, but that would affect a lot of other code... Please, somebody tell me I am imagining things and there is a silly error in above code. Peter Frey