From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by sympa.inria.fr (Postfix) with ESMTPS id 2A9D5820A1 for ; Fri, 16 Aug 2013 12:53:20 +0200 (CEST) Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of armael.gueneau@ens-lyon.fr) identity=pra; client-ip=140.77.51.2; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="armael.gueneau@ens-lyon.fr"; x-sender="armael.gueneau@ens-lyon.fr"; x-conformance=sidf_compatible Received-SPF: Pass (mail2-smtp-roc.national.inria.fr: domain of armael.gueneau@ens-lyon.fr designates 140.77.51.2 as permitted sender) identity=mailfrom; client-ip=140.77.51.2; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="armael.gueneau@ens-lyon.fr"; x-sender="armael.gueneau@ens-lyon.fr"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of postmaster@jabiru.ens-lyon.fr) identity=helo; client-ip=140.77.51.2; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="armael.gueneau@ens-lyon.fr"; x-sender="postmaster@jabiru.ens-lyon.fr"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AnEAAOYDDlKMTTMCmWdsb2JhbABbgzuJSLYlgSYWDgEBAQEBCAsLBxQogiUBBXgRCwQdFg8JAwIBAgFFEwgBAYgMBAi5fY1GgxEWg3wDng2OSg X-IPAS-Result: AnEAAOYDDlKMTTMCmWdsb2JhbABbgzuJSLYlgSYWDgEBAQEBCAsLBxQogiUBBXgRCwQdFg8JAwIBAgFFEwgBAYgMBAi5fY1GgxEWg3wDng2OSg X-IronPort-AV: E=Sophos;i="4.89,894,1367964000"; d="scan'208,217";a="29500178" Received: from jabiru.ens-lyon.fr ([140.77.51.2]) by mail2-smtp-roc.national.inria.fr with ESMTP; 16 Aug 2013 12:53:19 +0200 Received: from localhost (localhost [127.0.0.1]) by jabiru.ens-lyon.fr (Postfix) with ESMTP id DBAF81EB0DB for ; Fri, 16 Aug 2013 12:53:19 +0200 (CEST) X-Virus-Scanned: by amavisd-new-2.6.4 (20090625) (Debian) at ens-lyon.fr Received: from jabiru.ens-lyon.fr ([127.0.0.1]) by localhost (jabiru.ens-lyon.fr [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id BIewJ2YYfdcu for ; Fri, 16 Aug 2013 12:53:18 +0200 (CEST) Received: from [192.168.1.16] (AToulouse-754-1-14-138.w90-55.abo.wanadoo.fr [90.55.53.138]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by jabiru.ens-lyon.fr (Postfix) with ESMTPSA id DCDF91EB37C for ; Fri, 16 Aug 2013 12:53:17 +0200 (CEST) Message-ID: <520E049C.5020502@ens-lyon.fr> Date: Fri, 16 Aug 2013 12:53:16 +0200 From: =?ISO-8859-1?Q?Arma=EBl_Gu=E9neau?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130806 Thunderbird/17.0.8 MIME-Version: 1.0 To: caml-list@inria.fr References: <520CB9C8.7010108@coherentgraphics.co.uk> In-Reply-To: <520CB9C8.7010108@coherentgraphics.co.uk> Content-Type: multipart/alternative; boundary="------------080505010005020103050500" Subject: Re: [Caml-list] ANN: CamlPDF 1.7 This is a multi-part message in MIME format. --------------080505010005020103050500 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Hi, Le 15/08/2013 13:21, John Whitington a écrit : > The first new release of the CamlPDF library for a while is here: > > http://www.github.com/johnwhitington/camlpdf > > (Or, shortly, via OPAM.) Thanks! I have been playing with CamlPDF a bit, trying to do text extraction. I'm a total novice about the PDF format, so i might be doing it wrong, but I was wondering if there were facilities, in CamlPDF, to handle diacritics and ligatures. For example, when reading the PDF operators for "Université", I get Pdfops_TJ (Pdf.Array [Pdf.String "Universit"; Pdf.String "\019"; Pdf.Real 486.; Pdf.String "e"]) For "efficient", with "ffi" being ligated, I get Pdfops_TJ (Pdf.Array [Pdf.String "e\014cient"]) How can I convert these back, especially the ligature? I tried to use the conversion functions of Pdftext, like codepoints_of_text followed by utf8_of_codepoints, but that didn't seem to work. It's highly possible that I'm also doing it wrong here. Armaël --------------080505010005020103050500 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit
Hi,

Le 15/08/2013 13:21, John Whitington a écrit :
The first new release of the CamlPDF library for a while is here:

http://www.github.com/johnwhitington/camlpdf

(Or, shortly, via OPAM.)
Thanks!

I have been playing with CamlPDF a bit, trying to do text extraction.
I'm a total novice about the PDF format, so i might be doing it wrong,
but I was wondering if there were facilities, in CamlPDF, to handle
diacritics and ligatures.

For example, when reading the PDF operators for "Université", I get

  Pdfops_TJ (Pdf.Array [Pdf.String "Universit"; Pdf.String "\019"; Pdf.Real 486.; Pdf.String "e"])

For "efficient", with "ffi" being ligated, I get

  Pdfops_TJ (Pdf.Array [Pdf.String "e\014cient"])

How can I convert these back, especially the ligature? I tried to use the
conversion functions of Pdftext, like codepoints_of_text followed by
utf8_of_codepoints, but that didn't seem to work. It's highly possible
that I'm also doing it wrong here.

Armaël
--------------080505010005020103050500--