* [Caml-list] Bug with really_input under cygwin @ 2004-03-09 22:30 Eric Dahlman 2004-03-09 22:52 ` Karl Zilles 2004-03-10 3:06 ` skaller 0 siblings, 2 replies; 12+ messages in thread From: Eric Dahlman @ 2004-03-09 22:30 UTC (permalink / raw) To: caml-list Howdy all, I have some code which is reads in a whole file in and returns it as a string. To do this I am using a combination of in_channel_length and really_input which has worked just fine in a Unix environment but which breaks under Cygwin. This looks like a problem with line ending translation where the length reported by in_channel_length counts the DOS line endings as two characters but really_input reads them in as one. The end result is the length is too long by the number of newlines in the file and the call to really_input fails. Here is a function which demonstrates the problem let measureUp () = let (name, channel) = Filename.open_temp_file "temp" ".foo" in List.iter (fun x -> output_string channel x) [ "This\n" ; "is\n" ; "a\n" ; "spiffy\n" ; "test\n" ]; close_out channel; (* now read it back in *) let ins = open_in name in let length = in_channel_length ins in let result = String.create length in really_input ins result 0 length; close_in ins; Unix.unlink name; result This function works fine under Unix but will fail under Cygwin. I have tried to use set_binary_mode_* to see if that would help but it did not alter the the results. So that leaves me with a couple of questions: How should I slurp a whole file into a string portably in ocaml? Is this a bug or just and unfortunate result of running under windows? (At the very least it is a documentation bug.) Going back to the original problem that started me down this road is there an analogue to string-streams in common lisp which are special output channels which write to a string in memory rather than a file on disk? Thanks a bunch! -Eric ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Caml-list] Bug with really_input under cygwin 2004-03-09 22:30 [Caml-list] Bug with really_input under cygwin Eric Dahlman @ 2004-03-09 22:52 ` Karl Zilles 2004-03-10 3:06 ` skaller 1 sibling, 0 replies; 12+ messages in thread From: Karl Zilles @ 2004-03-09 22:52 UTC (permalink / raw) To: Eric Dahlman; +Cc: caml-list Eric Dahlman wrote: > This function works fine under Unix but will fail under Cygwin. I have > tried to use set_binary_mode_* to see if that would help but it did not > alter the the results. So that leaves me with a couple of questions: Use open_in_bin to avoid translation of line endings. Of course, your string will then contain \r\n after each line, but this may be ok for you. ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Caml-list] Bug with really_input under cygwin 2004-03-09 22:30 [Caml-list] Bug with really_input under cygwin Eric Dahlman 2004-03-09 22:52 ` Karl Zilles @ 2004-03-10 3:06 ` skaller 2004-03-10 4:10 ` David Brown 2004-03-10 15:25 ` Nuutti Kotivuori 1 sibling, 2 replies; 12+ messages in thread From: skaller @ 2004-03-10 3:06 UTC (permalink / raw) To: Eric Dahlman; +Cc: caml-list On Wed, 2004-03-10 at 09:30, Eric Dahlman wrote: > Howdy all, > > I have some code which is reads in a whole file in and returns it as a > string. The only correct way to do this is to read a block at a time until you get a partial block. This is so EVEN in 'binary' mode, which is just another ill conceived Unix hack :-) Generally speaking, every output method should specify a retrieval method or two, and you will only get well defined results if you use the specified retrieval method. It is unfortunate that C and Unix do not provide a coherent abstraction in this area. Even binary I/O is ill-conceived: who says the bytes get written in order and read in the same order? What if one channel is opened in 16 bit word mode, and the other 8 bit mode? C has been plagued by extremely ill considered functions. Even the basic IO operation is not correctly defined. In particular the function putc(int) is an invalid specification. What happens if int = char and you have 1's complement encoding? The bottom line is: if you wrote the file yourself, there should be no problem. Just use BASIC I/O operations. Functions like 'in_channel_length' are not properly defined in the Ocaml manual and therefore should not be used. There is no such thing as 'the number of characters in a file'. Perhaps there is a number of bytes in a file. Perhaps, using some decoding technique there is a well defined number of Unicode/ISO-10646 code points. In MS-DOS, files *always* consist of a number of 256 byte blocks. It is impossible to have a file with a non-256 byte multiple size. Of course, text files uses an encoding with a Ctrl-Z at the end. So the length of the file 'in bytes' is not the same as the length of the file 'in Latin-1'. The number of lines in the file isn't well defined: CR/LF marks end of line, but what happens if the CR and LF are scattered randomly? Under Linux, the Standard for text encoding is UTF-8. So 'characters' <> bytes unless the text is in the ASCII subset. Even that is not clear, since if you get a code point 0 (NUL) some C functions will return a false result, for example fgets(). I personally believe the easiest way to work around this quagmire of malspecification is to (a) ONLY use 8 bit binary I/O (b) ALWAYS read and write bytes even if you're processing text. Never depend on the language or OS conversion functions, its very unlikely they'll be right. Do all the conversions needed yourself. At least when you find a problem you're not handling correctly you can fix it. -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Caml-list] Bug with really_input under cygwin 2004-03-10 3:06 ` skaller @ 2004-03-10 4:10 ` David Brown 2004-03-10 13:14 ` Richard Zidlicky 2004-03-11 3:24 ` skaller 2004-03-10 15:25 ` Nuutti Kotivuori 1 sibling, 2 replies; 12+ messages in thread From: David Brown @ 2004-03-10 4:10 UTC (permalink / raw) To: skaller; +Cc: Eric Dahlman, caml-list On Wed, Mar 10, 2004 at 02:06:59PM +1100, skaller wrote: > In MS-DOS, files *always* consist of a number of 256 > byte blocks. It is impossible to have a file with > a non-256 byte multiple size. Of course, text files > uses an encoding with a Ctrl-Z at the end. So the length > of the file 'in bytes' is not the same as the length > of the file 'in Latin-1'. The number of lines in the > file isn't well defined: CR/LF marks end of line, > but what happens if the CR and LF are scattered randomly? Is this true with "modern" version of DOS? FAT has a length-in-bytes field in the directory entry. Dave Brown ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Caml-list] Bug with really_input under cygwin 2004-03-10 4:10 ` David Brown @ 2004-03-10 13:14 ` Richard Zidlicky 2004-03-11 4:11 ` skaller 2004-03-11 3:24 ` skaller 1 sibling, 1 reply; 12+ messages in thread From: Richard Zidlicky @ 2004-03-10 13:14 UTC (permalink / raw) To: David Brown; +Cc: skaller, Eric Dahlman, caml-list On Tue, Mar 09, 2004 at 08:10:09PM -0800, David Brown wrote: > On Wed, Mar 10, 2004 at 02:06:59PM +1100, skaller wrote: > > > In MS-DOS, files *always* consist of a number of 256 > > byte blocks. It is impossible to have a file with > > a non-256 byte multiple size. Of course, text files > > uses an encoding with a Ctrl-Z at the end. So the length > > of the file 'in bytes' is not the same as the length > > of the file 'in Latin-1'. The number of lines in the > > file isn't well defined: CR/LF marks end of line, > > but what happens if the CR and LF are scattered randomly? > > Is this true with "modern" version of DOS? FAT has a length-in-bytes > field in the directory entry. it was never true in DOS, it was in CP/M Richard ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Caml-list] Bug with really_input under cygwin 2004-03-10 13:14 ` Richard Zidlicky @ 2004-03-11 4:11 ` skaller 0 siblings, 0 replies; 12+ messages in thread From: skaller @ 2004-03-11 4:11 UTC (permalink / raw) To: Richard Zidlicky; +Cc: David Brown, skaller, Eric Dahlman, caml-list On Thu, 2004-03-11 at 00:14, Richard Zidlicky wrote: > > field in the directory entry. > > it was never true in DOS, it was in CP/M yes, it was true in DOS 1 .. but then that was just a copy of CP/M :D -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Caml-list] Bug with really_input under cygwin 2004-03-10 4:10 ` David Brown 2004-03-10 13:14 ` Richard Zidlicky @ 2004-03-11 3:24 ` skaller 1 sibling, 0 replies; 12+ messages in thread From: skaller @ 2004-03-11 3:24 UTC (permalink / raw) To: David Brown; +Cc: skaller, Eric Dahlman, caml-list On Wed, 2004-03-10 at 15:10, David Brown wrote: > On Wed, Mar 10, 2004 at 02:06:59PM +1100, skaller wrote: > > > In MS-DOS, files *always* consist of a number of 256 > > byte blocks. > Is this true with "modern" version of DOS? FAT has a length-in-bytes > field in the directory entry. No, its not true in Windows 3 style DOS which uses the newer FAT, eg my Win 98 box: however the point was simpler. Unix' idea that a file is a sequence of bytes is simply wrong. The abstraction is convenient on the surface, but underneath its the wrong idea. In fact the older IBM-DOS systems (I mean 360 machines not PCs :) was and still is closer to reality: those systems needed macros for every different kind of device. VSAM files were quite different to Indexed Sequential disk files (which were supported directly by HARDWARE disk operations). The thing is, abstraction is a tricky game. There's no way you can sensibly abstract a graphics interface to a file concept for example. Nor a sound card, etc. My point isn't to be critical, so much as to indicate that when one *does* try to be too abstract, something is sure to be lost. For example 'length of channel' simply doesn't make sense on non-storage devices. How long is a terminal file? Worse, 'length of channel' doesn't really make sense on storage devices either. The actual stored length is indeterminate and irrelevant if the data is compressed. And the 'length read by the client' could be equally meaningless .. consider the output from a database select * statement ... In other words, the length_in_channel problem is really a symptom of an intractible problem: we need abstraction, but there is never really any good one for something like storage devices or I/O devices, because underneath they're different. The only real solution is Standards, and they're not so good either :D -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Caml-list] Bug with really_input under cygwin 2004-03-10 3:06 ` skaller 2004-03-10 4:10 ` David Brown @ 2004-03-10 15:25 ` Nuutti Kotivuori 2004-03-11 3:42 ` skaller 1 sibling, 1 reply; 12+ messages in thread From: Nuutti Kotivuori @ 2004-03-10 15:25 UTC (permalink / raw) To: Eric Dahlman; +Cc: skaller, caml-list skaller@users.sourceforge.net wrote: > On Wed, 2004-03-10 at 09:30, Eric Dahlman wrote: >> Howdy all, >> >> I have some code which is reads in a whole file in and returns it >> as a string. If you have a master's degree in reading in between the rant, you probably picked out the right answer from the text below. But here it is as a simple answer: Loop doing 'input' on the file, until 'input' returns zero. 'really_input' is ofcourse nice and easy, but since you have no really proper way of knowing how large the entire file is going to be in the end, you need to make a decision with the buffer size anyway. Binary or non-binary mode only affects the \r\n -> \n translation while reading the file - and vice versa while writing. > The only correct way to do this is to read a block at a time > until you get a partial block. > > This is so EVEN in 'binary' mode, which is just another > ill conceived Unix hack :-) [...] > It is unfortunate that C and Unix do not provide a coherent > abstraction in this area. Even binary I/O is ill-conceived: [...] > C has been plagued by extremely ill considered functions. > Even the basic IO operation is not correctly defined. [...] > There is no such thing as 'the number of characters > in a file'. Perhaps there is a number of bytes in a file. [...] > In MS-DOS, files *always* consist of a number of 256 > byte blocks. It is impossible to have a file with > a non-256 byte multiple size. Of course, text files > uses an encoding with a Ctrl-Z at the end. [...] > Under Linux, the Standard for text encoding is UTF-8. [...] > I personally believe the easiest way to work around this > quagmire of malspecification is to > > (a) ONLY use 8 bit binary I/O > (b) ALWAYS read and write bytes > > even if you're processing text. Never depend on the > language or OS conversion functions, its very unlikely > they'll be right. Do all the conversions needed yourself. > At least when you find a problem you're not handling > correctly you can fix it. Luckily not everybody sees the world as glum :-) -- Naked ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Caml-list] Bug with really_input under cygwin 2004-03-10 15:25 ` Nuutti Kotivuori @ 2004-03-11 3:42 ` skaller 2004-03-11 5:02 ` Nuutti Kotivuori 2004-03-11 6:32 ` james woodyatt 0 siblings, 2 replies; 12+ messages in thread From: skaller @ 2004-03-11 3:42 UTC (permalink / raw) To: Nuutti Kotivuori; +Cc: Eric Dahlman, skaller, caml-list On Thu, 2004-03-11 at 02:25, Nuutti Kotivuori wrote: > > even if you're processing text. Never depend on the > > language or OS conversion functions, its very unlikely > > they'll be right. Do all the conversions needed yourself. > > At least when you find a problem you're not handling > > correctly you can fix it. > > Luckily not everybody sees the world as glum :-) I'm not seeing it as glum. I'm pointing out that today the situation is vastly more complex due to belated recognition of the need for Standards to support I18N issues. Because of this the idea that \r\n <-> \n is the only real encoding issue across platforms is wrong. If only that were the case today, it would be a trivial problem to resolve. For example, text files may contain certain header bytes that indicate if the file is UTF8 encoded, or UCS-2 with big or little endian: these bytes if found must not be considered as 'text', they're just encoding indicators. Even within Unicode/ISO-10646 there are myrriad 'encoding' problems, the famous ones being the use of combining characters -- and that's *after* you have found the ISO10646 code points :) So, if you want to handle *text* in a portable way, you have some work ahead of you. Don't even try to render it correctly, the required algorithm competes with Mr Ackermann in performance :D As long as these kinds of comments are labelled as 'rants' people will continue to write non-portable software and fail to face up to the issues. -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Caml-list] Bug with really_input under cygwin 2004-03-11 3:42 ` skaller @ 2004-03-11 5:02 ` Nuutti Kotivuori 2004-03-11 15:21 ` skaller 2004-03-11 6:32 ` james woodyatt 1 sibling, 1 reply; 12+ messages in thread From: Nuutti Kotivuori @ 2004-03-11 5:02 UTC (permalink / raw) To: skaller; +Cc: caml-list skaller@users.sourceforge.net wrote: > On Thu, 2004-03-11 at 02:25, Nuutti Kotivuori wrote: >> Luckily not everybody sees the world as glum :-) > > I'm not seeing it as glum. I'm pointing out that today the situation > is vastly more complex due to belated recognition of the need for > Standards to support I18N issues. > > Because of this the idea that \r\n <-> \n is the only real encoding > issue across platforms is wrong. If only that were the case today, > it would be a trivial problem to resolve. > > For example, text files may contain certain header bytes that > indicate if the file is UTF8 encoded, or UCS-2 with big or little > endian: these bytes if found must not be considered as 'text', > they're just encoding indicators. > > Even within Unicode/ISO-10646 there are myrriad 'encoding' problems, > the famous ones being the use of combining characters -- and that's > *after* you have found the ISO10646 code points :) > > So, if you want to handle *text* in a portable way, you have some > work ahead of you. Don't even try to render it correctly, the > required algorithm competes with Mr Ackermann in performance :D > > As long as these kinds of comments are labelled as 'rants' people > will continue to write non-portable software and fail to face up to > the issues. I have left the entire text here quoted to point out the difference in subjects. Sure, handling *text* is a really, really complex beast in today's world. I end up fighting with those problems almost daily. You are preaching to the choir. But - there's nothing ambiguous about slurping an entire file into a string. And there's nothing complex about doing that portably. Encodings, byte-order-marks, combining characters, text printing and all that do not enter into it. The \r\n <-> \n translation issue is the first portability hurdle, since it affects plain byte input and output, regardless of implications for text. String as an array of characters is a really complex beast to handle. String as an array of bytes is trivial to handle. And the encoding issues do not suddenly make 'md5sum' any less portable. Or 'rsync'. Or 'wget'. But the \r\n <-> \n issue does. -- Naked ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Caml-list] Bug with really_input under cygwin 2004-03-11 5:02 ` Nuutti Kotivuori @ 2004-03-11 15:21 ` skaller 0 siblings, 0 replies; 12+ messages in thread From: skaller @ 2004-03-11 15:21 UTC (permalink / raw) To: Nuutti Kotivuori; +Cc: skaller, caml-list On Thu, 2004-03-11 at 16:02, Nuutti Kotivuori wrote: > skaller@users.sourceforge.net wrote: > But - there's nothing ambiguous about slurping an entire file into a > string. And there's nothing complex about doing that portably. A file is an abstraction, meaning it is a collection of access methods. There's no 'bytes' in a file, the bytes are just values return from the read function or submitted to the write function. Many files exist with many different access methods. Even the same physical disk data can be different file abstractions: eg Unix directory. The data on an unbuffered non-blocking serial communication link cannot be read with the block reading algorithm I gave: end of data occurs before end of file -- if there is an end of file .. eg it won't work on a raw terminal. Yes, I know we have strayed a long way from the original question, but I think the issue here is: the manual isn't precise enough, but making it so is in fact very difficult, so the Ocaml team is 'forgiven'. That's what I think I was trying to say :D -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Caml-list] Bug with really_input under cygwin 2004-03-11 3:42 ` skaller 2004-03-11 5:02 ` Nuutti Kotivuori @ 2004-03-11 6:32 ` james woodyatt 1 sibling, 0 replies; 12+ messages in thread From: james woodyatt @ 2004-03-11 6:32 UTC (permalink / raw) To: The Trade; +Cc: skaller On 10 Mar 2004, at 19:42, skaller wrote: > > As long as these kinds of comments are labelled as 'rants' people will > continue to write non-portable software and fail to face up to the > issues. Can I get an "Amen!" brothers and sisters. And while we are categorizing difficulties handling text that too many people are unwilling to face realistically, can we bring up the constellation of issues revolving around lexicographical comparisons? Text comparison (sorting), parsing and matching can be a royal pain in the rear end if you're trying to maintain localization capabilities. We all desperately need to pay more attention to localization— especially if we are committed to retaining our cultural affinity for dead and dying natural languages in the digital era. -- j h woodyatt <jhw@wetware.com> markets are only free to the people who own them. ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2004-03-11 15:17 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2004-03-09 22:30 [Caml-list] Bug with really_input under cygwin Eric Dahlman 2004-03-09 22:52 ` Karl Zilles 2004-03-10 3:06 ` skaller 2004-03-10 4:10 ` David Brown 2004-03-10 13:14 ` Richard Zidlicky 2004-03-11 4:11 ` skaller 2004-03-11 3:24 ` skaller 2004-03-10 15:25 ` Nuutti Kotivuori 2004-03-11 3:42 ` skaller 2004-03-11 5:02 ` Nuutti Kotivuori 2004-03-11 15:21 ` skaller 2004-03-11 6:32 ` james woodyatt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox