Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed
From: Gerd Stolpmann <info@gerd-stolpmann.de>
To: Diego Olivier Fernandez Pons <dofp.ocaml@gmail.com>
Cc: caml-list <caml-list@inria.fr>
Subject: Re: [Caml-list] SQL engine in OCaml with client side cache
Date: Sun, 29 Jan 2012 21:16:17 +0100	[thread overview]
Message-ID: <1327868178.19516.60.camel@samsung> (raw)
In-Reply-To: <CAHqiZ-LCQFKuuMVtL_WAbUOMVKfBg4vg45OCVZoSmryi5+Az6g@mail.gmail.com>

Am Sonntag, den 29.01.2012, 19:29 +0100 schrieb Diego Olivier Fernandez
Pons:
> Caml-list,
> 
> [Gerd Stolpmann wrote]
> > I think this is not possible. SQL always needs access to the complete
> > table for executing queries (including the complete indexes).

"Not possible" in the sense of: yielding a result with a performance
that is better than without cache. Also I was assuming the general case.

> I am surprised by your comments. Many systems have two-layer data
> storage (massive but slow one, fast but limited one), and that doesn't
> affect their semantics. 

The problem is to define locality here. Imagine you have a table with
two indexes, and the indexes are pretty independent of each other in
terms of locality (i.e. when y is the direct successor of x in one
index, the distance in the other index is large). How do you order the
rows of the table? For efficient caching it is essential that you can
guess which other data items will be accessed next. If you have
multi-dimensional locality as in this example, though, it will be
difficult to guess.

The SQL server "solves" this by deciding on a query strategy beforehand.
E.g. for an inner join involving two indexes it is usually possible to
iterate over one index, and then do lookups over the other one. This is
slow enough on a local disk - if you add more latency for the lookups on
the second index (because they are not in the cache), performance will
even be worse.

Note that there are also join algorithms that really need complete
tables, like sort-and-merge.

> Consider a compiler (heap/ registers), an
> operating system (swap files, RAM) or even a typical SQL database that
> keeps tables in memory for fast access.

This works on page level only (i.e. databases either rely on the page
cache of the OS, or implement their own page cache following a similar
design). So, it is pretty low-level.

If you try to have a cache on a higher level, you run into the problem
how to get the data via SQL statements. The following simple strategy
does not work: If the statement cannot be fully responded from the
cache, just send it to the server, and put the response rows into the
cache. Because: You don't know which rows have been omitted. You would
not be able to respond range queries from the cache.

So, to have a working cache you need quite direct access to the data,
e.g. like "retrieve all data from this position to that position", both
for tables and indexes. There are SQL cursors, but, so far I know, they
are not precise enough for this. I think this is probably solvable if
you know the data definition.

> This is the same but the hard drive is on the server side, the RAM on
> the client side and they communicate by Internet.
> 
> I am not saying that implementing a generic cache / swapping system
> for an SQL database is easy.

I think if you solved the generic case, you'd be a candidate for the
Turing award.

(Just as I side note: Missing scalability has always been a problem for
SQL databases. Currently, many companies go away from this technology
for their giant data warehouses just because of this. Of course, these
are not read-only, but read-write, but even having read-only scalability
would be a big achievement.)

>  That's why I added that for this specific
> application I can easily compute a superset of the data the user will
> need. Lets take an example : say my database contains sales history
> data for a convenience store chain for all stores in France, all
> products for last 5 years. If my user is the Coca-Cola replenishment
> manager in Paris, he only needs sales of Coca-Cola products for the
> last 5 similar days for each store in Paris. Thereafter I can generate
> a query on the server that sends that superset of data to the client
> and let the SQL client engine work on that.

This approach could in deed work. Basically, you define the locality
directly.

> I may have underestimated the amount of work to be done, because I
> thought this would be on the easier side.
> My idea was the following
> 
> Server side
> - 3rd part SQL database
> - OCaml bindings
> - Web communication interface
> 
> Client side
> - SQL parser written in OCaml + interpreter
> - control system to guess superset of data and request from server
> - Web communication interface
> - output to JavaScript widget toolkit
> 
> I was expecting to find most of the pieces already done and only
> having to glue them. I was actually more afraid of the JavaScript part
> than the SQL one... You can even find SQL engines written in
> JavaScript out there, to run in a web browser like Jade
> http://jacwright.com/489/in-memory-javascript-database/ or JOrder
> (JSON) https://github.com/danstocker/jorder
> 
> At the end that's nothing but arrays and for loops.

And an ugly parser :-(

If you can define the superset in a way so that it only depends on the
client but not on the client's queries, an option would be to use an
existing in-memory SQL database as cache. Sqlite has such a mode (use
":memory:" as filename), and AFAIK MySql, too.

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
Creator of GODI and camlcity.org.
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------


  reply	other threads:[~2012-01-29 20:16 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-01-29  9:56 Diego Olivier Fernandez Pons
2012-01-29 11:15 ` Gerd Stolpmann
2012-01-29 18:29   ` Diego Olivier Fernandez Pons
2012-01-29 20:16     ` Gerd Stolpmann [this message]
2012-01-29 23:26       ` Diego Olivier Fernandez Pons
2012-01-30  9:12         ` Gabriel Scherer
2012-01-31 15:04         ` Gerd Stolpmann
2012-01-29 11:39 ` Marc Weber
2012-01-29 13:42 ` Daniel Bünzli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1327868178.19516.60.camel@samsung \
    --to=info@gerd-stolpmann.de \
    --cc=caml-list@inria.fr \
    --cc=dofp.ocaml@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox