* [1/2 OT] Indexing (and mergeable Index-algorithms) @ 2005-11-16 23:42 Oliver Bandel 2005-11-17 8:15 ` [Caml-list] " skaller ` (4 more replies) 0 siblings, 5 replies; 30+ messages in thread From: Oliver Bandel @ 2005-11-16 23:42 UTC (permalink / raw) To: caml-list Hello, I'm looking for indexing algorithms and especially - if such a thing exists - mergeable/extendable indexing algorithms. So, say we have 10^6 texts that we want ot have an index for, to retrieve the text according to some parts of the text (keywords, substrings,...). We want to make an index from these texts. After a while we get 10^5 new texts and want to extend the exisiting index, so that the whole index not necessarily must be created again, with the indexer-tool running on all files (^10^6 + 10^5) again, but only have to index the new files, but the big index can be extended with additional smaller indizes. Is there something like that already existing? Or must the new index be created on all files again, or must there be a workaround with the big and a small index-file, where handling of both would be a solution we must provide by ourselfes? It's mainly a question on datastructures/algorithms, so this mailing list may be the wrong, but the reason to aske here is: Are functional datastructures in some way good for implementing such tools? BTW: Let's mention that the application I intended to write is performance critical.... so, if functional datastructures are quite good for such extendable indexes, but are too slow, then thsi would also be a problem. Any hints here? (Maybe using OCaml, but the imperative features of it would help, if the functional features would be too slow?) Any hint on algorithms/datastructures for this would be fine... Thanks In Advance, Oliver ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-16 23:42 [1/2 OT] Indexing (and mergeable Index-algorithms) Oliver Bandel @ 2005-11-17 8:15 ` skaller 2005-11-17 15:09 ` Brian Hurt 2005-11-17 8:35 ` Florian Hars ` (3 subsequent siblings) 4 siblings, 1 reply; 30+ messages in thread From: skaller @ 2005-11-17 8:15 UTC (permalink / raw) To: Oliver Bandel; +Cc: caml-list On Thu, 2005-11-17 at 00:42 +0100, Oliver Bandel wrote: > So, say we have 10^6 texts that we want ot have an index for, > to retrieve the text according to some parts of the text > (keywords, substrings,...). > We want to make an index from these texts. BTree or modern equivalent may be an option. NTFS directories are BTrees. Balanced BTree guarantees extremely small constant time lookup for all keys. (typically 3 or 4 accesses) Many updates are also fast constant time. However the Btrees are very expensive to rebalance, and occasionally an update requires a global rebalancing which brings the world to a complete stop for a very long time. BTree is ideal for large data structures, since the nodes are chosen to match up with disk block sizes, and the underlying I/O is done with low level I/O calls, that is, BTree is excellent for hierarchical storage media (such as virtual memory). One system I worked on used BTrees with spill list for those updates which would require a rebalancing. The merge/rebalancing is then done whilst the tree is offline, accesses are slowed down by the spill list until the rebalancing is done. Btree is probably good for a search engine where you run two trees on two servers and do the rebalancing on the offline one (then swap servers). Some modern variants amortise the costs differently, typically reducing the worst case at the expense of the other operations. In particular, the very worst way to populate a BTree from a list is if the list is already sorted, however a smart algorithm can build an empty skeleton first and populate it in linear time (provided only you know how many keys there are). Obviously, the tree has to be offline until this operation is completed. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-17 8:15 ` [Caml-list] " skaller @ 2005-11-17 15:09 ` Brian Hurt 2005-11-17 17:31 ` skaller 0 siblings, 1 reply; 30+ messages in thread From: Brian Hurt @ 2005-11-17 15:09 UTC (permalink / raw) To: skaller; +Cc: Oliver Bandel, caml-list On Thu, 17 Nov 2005, skaller wrote: > Balanced BTree guarantees extremely small constant time lookup > for all keys. (typically 3 or 4 accesses) > Many updates are also fast constant time. However > the Btrees are very expensive to rebalance, and occasionally > an update requires a global rebalancing which brings the > world to a complete stop for a very long time. Not as I understand BTrees. BTrees are generializations of 2,3,4 trees. Each "node" of the tree has between k/2 and k (for k > 3) children (except for the root node, which can have anywhere from 2 to k children). When adding a new child to a given node, if the number of children exceeds k, then the node is split into two nodes, each with (k+1)/2 children (if k is even, one of the two nodes gets the extra). The new sibling is then added to the parent of the original node. When the root node is split, then a new level is added above the root node (note, changing the depth of the entire tree at the same time). Likewise, when children are removed, if the node falls below k/2 children, it's merged with one of it's siblings. If the root node drops to only having 1 child, it's removed, and it's lone child becomes the new root, again changing the depth of the entire tree at once. For in-memory data structures, Btrees are less efficient than standard balanced binary trees. See, the problem is that you do a search on the BTree, you have to do a binary search on the children at each node. So the cost of doing a search on a BTree with N elements is log base k of N nodes times log base 2 of k for the binary search in each node. A little bit of algebra proves that this is equal to the log base 2 of N, the same cost as searching a binary tree (basically). Worse yet, when inserting or deleting an object, the average cost is O(k)- you have to insert or remove elements into/from the middle of an array. If k > log base 2 of N (likely), then the standard balanced binary tree will be faster on inserts and deletes. Where BTrees shine is in disk based data structures. Here, the main bottleneck is reading data off the disk. For N=2^32 and k=256, a standard balanced binary tree would require up to 64 disk reads, the BTree only 5. > Some modern variants amortise the costs differently, > typically reducing the worst case at the expense > of the other operations. In particular, the very > worst way to populate a BTree from a list is if > the list is already sorted, however a smart algorithm > can build an empty skeleton first and populate it > in linear time (provided only you know > how many keys there are). Obviously, the tree > has to be offline until this operation is completed. Actually, if the data is already sorted, creating the tree should be O(N) cost. At each level, you're adding children to the same node. You keep adding children until you hit a limit (I'd suggest 3k/4 as a good limit). When one node is full, you create a new node (adding it to the next layer up) and start adding elements to it. When you're done adding children, if the current node you're adding to has less than k/2 nodes, it gets merged with it's previous sibling. Proving that this is O(N) total cost is left as an exercise to the reader. Brian ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-17 15:09 ` Brian Hurt @ 2005-11-17 17:31 ` skaller 2005-11-17 18:08 ` Brian Hurt 0 siblings, 1 reply; 30+ messages in thread From: skaller @ 2005-11-17 17:31 UTC (permalink / raw) To: Brian Hurt; +Cc: Oliver Bandel, caml-list On Thu, 2005-11-17 at 09:09 -0600, Brian Hurt wrote: > > On Thu, 17 Nov 2005, skaller wrote: > > > Balanced BTree guarantees extremely small constant time lookup > > for all keys. (typically 3 or 4 accesses) > > Many updates are also fast constant time. However > > the Btrees are very expensive to rebalance, and occasionally > > an update requires a global rebalancing which brings the > > world to a complete stop for a very long time. > > Not as I understand BTrees. I'm not sure what it is we disagree on. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-17 17:31 ` skaller @ 2005-11-17 18:08 ` Brian Hurt 2005-11-17 18:57 ` skaller 0 siblings, 1 reply; 30+ messages in thread From: Brian Hurt @ 2005-11-17 18:08 UTC (permalink / raw) To: skaller; +Cc: Oliver Bandel, caml-list On Fri, 18 Nov 2005, skaller wrote: > On Thu, 2005-11-17 at 09:09 -0600, Brian Hurt wrote: >> >> On Thu, 17 Nov 2005, skaller wrote: >> >>> Balanced BTree guarantees extremely small constant time lookup >>> for all keys. (typically 3 or 4 accesses) >>> Many updates are also fast constant time. However >>> the Btrees are very expensive to rebalance, and occasionally >>> an update requires a global rebalancing which brings the >>> world to a complete stop for a very long time. >> >> Not as I understand BTrees. > > I'm not sure what it is we disagree on. They don't ever need global rebalancing. Since levels are only added or removed at the top, the distance between the root and any leaf is the same as any other leaf, at all times. Brian ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-17 18:08 ` Brian Hurt @ 2005-11-17 18:57 ` skaller 2005-11-17 22:15 ` Brian Hurt 0 siblings, 1 reply; 30+ messages in thread From: skaller @ 2005-11-17 18:57 UTC (permalink / raw) To: Brian Hurt; +Cc: Oliver Bandel, caml-list On Thu, 2005-11-17 at 12:08 -0600, Brian Hurt wrote: > > I'm not sure what it is we disagree on. > > They don't ever need global rebalancing. Yup, they do not need it to maintain the invariant you stated. But performance can still benefit from it significantly. Two quite distinct BTrees can contain the same data exactly, it depends on the order of insertion of keys. If you are clever, you can fill up a BTree so every block is exactly full .. however you don't need to be clever to get the worst possible BTree -- just fill an empty tree with sorted data and almost all the blocks are guaranteed to be half full (the worst possible case for storage use and access time). Yet, this is a common case in practice. ** all the blocks will be half full except those on the right edge of the tree -- for a tree of depth 5 that's millions of half full blocks, and only 5 that can possibly be more than half full :) And in between, blocks can be filled to different extents. Rebalancing adjusts this. The basic algorithm you describe has MANY tweaks which attempt to rebalance the tree. The version I played with did 'scrolling', where instead of splitting an overfull block, you scroll a key (and child) across to one of its siblings via the parent. This is quite tricky .. but it fixes the sorted insertion problem nicely -- with this modification, all the blocks are always FULL .. except those on the right most edge and their left siblings: they grow until they're both full, then the rightmost one (only) is split. So at worst, you waste 5 blocks out of millions.. basically this doubles the capacity of the tree (built from a sorted list). The same tweak can be used on random insertions the same way, however the extra reads and write may or may not be worth it, depending on how valuable your disk storage is (etc). -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-17 18:57 ` skaller @ 2005-11-17 22:15 ` Brian Hurt 2005-11-18 1:49 ` skaller 0 siblings, 1 reply; 30+ messages in thread From: Brian Hurt @ 2005-11-17 22:15 UTC (permalink / raw) To: skaller; +Cc: Oliver Bandel, caml-list On Fri, 18 Nov 2005, skaller wrote: > On Thu, 2005-11-17 at 12:08 -0600, Brian Hurt wrote: > >>> I'm not sure what it is we disagree on. >> >> They don't ever need global rebalancing. > > Yup, they do not need it to maintain the invariant > you stated. But performance can still benefit from it > significantly. > > Two quite distinct BTrees can contain the same data exactly, > it depends on the order of insertion of keys. If you are > clever, you can fill up a BTree so every block is exactly > full .. however you don't need to be clever to get the worst > possible BTree -- just fill an empty tree with sorted > data and almost all the blocks are guaranteed to be half > full (the worst possible case for storage use and > access time). Yet, this is a common case in practice. > ** all the blocks will be half full except those on > the right edge of the tree -- for a tree of depth 5 > that's millions of half full blocks, and only 5 that > can possibly be more than half full :) This is the worst possible case- that each block is half full. Which means that instead of log_k(N) blocks, you're having to touch log_{k/2}(N) blocks. This means that if N=2^32 and k=256, that you need to read 5 blocks instead of 4 (128^5 = 2^35). And the number of blocks you need has about doubled. Also note that the binary search per block is now cheaper (by one step), and the cost of inserting elements is half. So the question becomes: is the performance advantage gained by rebalancing worth the cost? If I was worried about it, I'd be inclined to be more agressive on merging and splitting nodes. Basically, if the node is under 5/8th full, I'd look to steal some children from siblings. If the node is over 7/8th full, I'd look to share some child with siblings. Note that if you have three nodes each 1/2 full, you can combine the three into two nodes, each 3/4th full. You want to keep nodes about 3/4th full, as that makes it cheaper to add and delete elements. > The version I played with did 'scrolling', where > instead of splitting an overfull block, you scroll > a key (and child) across to one of its siblings > via the parent. This is quite tricky .. but it > fixes the sorted insertion problem nicely -- > with this modification, all the blocks are > always FULL .. except those on the right most > edge and their left siblings: they grow until > they're both full, then the rightmost one (only) > is split. So at worst, you waste 5 blocks > out of millions.. basically this doubles the > capacity of the tree (built from a sorted list). Two problems with this: first, what happens when the sibling is full too, you can get into a case where an insert is O(N) cost, and second, this is assuming inserts only (I can still get to worst-case with deletes). Brian ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-17 22:15 ` Brian Hurt @ 2005-11-18 1:49 ` skaller 0 siblings, 0 replies; 30+ messages in thread From: skaller @ 2005-11-18 1:49 UTC (permalink / raw) To: Brian Hurt; +Cc: Oliver Bandel, caml-list On Thu, 2005-11-17 at 16:15 -0600, Brian Hurt wrote: > > This is the worst possible case- that each block is half full. Which > means that instead of log_k(N) blocks, you're having to touch log_{k/2}(N) > blocks. This means that if N=2^32 and k=256, that you need to read 5 > blocks instead of 4 (128^5 = 2^35). And the number of blocks you need has > about doubled. Also note that the binary search per block is now cheaper > (by one step), and the cost of inserting elements is half. > > So the question becomes: is the performance advantage gained by > rebalancing worth the cost? Yes, that's the question. And there is no single answer :) Note, it is not 5 reads instead of 4, it is 3 reads instead of 2 (assuming the first two levels are cached). A BTree system I used once was fixed at 3 levels. So it could be kind of critical :) > If I was worried about it, I'd be inclined to be more agressive on merging > and splitting nodes. Basically, if the node is under 5/8th full, I'd look > to steal some children from siblings. If the node is over 7/8th full, I'd > look to share some child with siblings. Note that if you have three nodes > each 1/2 full, you can combine the three into two nodes, each 3/4th full. > You want to keep nodes about 3/4th full, as that makes it cheaper to add > and delete elements. Yup. There are lots of possible tweaks :) > Two problems with this: first, what happens when the sibling is full too, > you can get into a case where an insert is O(N) cost, and second, this is > assuming inserts only (I can still get to worst-case with deletes). Depends precisely on the algorithm -- mine only looked once. If the sibling was full, you just split as usual. Its a cheap hack :) -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-16 23:42 [1/2 OT] Indexing (and mergeable Index-algorithms) Oliver Bandel 2005-11-17 8:15 ` [Caml-list] " skaller @ 2005-11-17 8:35 ` Florian Hars 2005-11-17 9:24 ` Oliver Bandel 2005-11-17 11:49 ` Florian Weimer ` (2 subsequent siblings) 4 siblings, 1 reply; 30+ messages in thread From: Florian Hars @ 2005-11-17 8:35 UTC (permalink / raw) To: Oliver Bandel; +Cc: caml-list Oliver Bandel wrote: > It's mainly a question on datastructures/algorithms I tend to try to find answers to such questions on CiteSeer, maybe you could start at the first paper I found with a quick search: http://citeseer.ist.psu.edu/cutting90optimizations.html and then look at the papers citing it, or similar to it. (WARNING: Excessive use of CiteSeer may lead to addiction.) Yours, Florian. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-17 8:35 ` Florian Hars @ 2005-11-17 9:24 ` Oliver Bandel 2005-11-17 12:39 ` Florian Weimer 0 siblings, 1 reply; 30+ messages in thread From: Oliver Bandel @ 2005-11-17 9:24 UTC (permalink / raw) To: caml-list On Thu, Nov 17, 2005 at 09:35:58AM +0100, Florian Hars wrote: > Oliver Bandel wrote: > >It's mainly a question on datastructures/algorithms > > I tend to try to find answers to such questions on CiteSeer, maybe you > could start at the first paper I found with a quick search: > http://citeseer.ist.psu.edu/cutting90optimizations.html > and then look at the papers citing it, or similar to it. well, thats, where my further search directed me to. :) I found an interesting paper there, about using updatable indexing ("Fast Incremental Indexing for Full-Text Information Retrieval" from Brown/Callen/Croft.) They talked about "inverted lists", and this together with other hints from this list may be a good starter. > > (WARNING: Excessive use of CiteSeer may lead to addiction.) Yes, that's true. I was an addict and hope to get clean, but now google (and you too) directed me back... ;-) Ciao, Oliver ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-17 9:24 ` Oliver Bandel @ 2005-11-17 12:39 ` Florian Weimer 2005-11-17 20:57 ` Oliver Bandel 0 siblings, 1 reply; 30+ messages in thread From: Florian Weimer @ 2005-11-17 12:39 UTC (permalink / raw) To: Oliver Bandel; +Cc: caml-list * Oliver Bandel: > I found an interesting paper there, about using updatable indexing > ("Fast Incremental Indexing for Full-Text Information Retrieval" > from Brown/Callen/Croft.) They talked about "inverted lists", and > this together with other hints from this list may be a good starter. If you need a full-text search capability on natural language documents, there are various libraries you could reuse: Xapian, Lucene (although this is one is writtein Java, IIRC), Estraier, OpenFTS, a MySQL component, and probably many more. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-17 12:39 ` Florian Weimer @ 2005-11-17 20:57 ` Oliver Bandel 2005-11-17 22:02 ` Florian Weimer 0 siblings, 1 reply; 30+ messages in thread From: Oliver Bandel @ 2005-11-17 20:57 UTC (permalink / raw) To: caml-list On Thu, Nov 17, 2005 at 01:39:43PM +0100, Florian Weimer wrote: > * Oliver Bandel: > > > I found an interesting paper there, about using updatable indexing > > ("Fast Incremental Indexing for Full-Text Information Retrieval" > > from Brown/Callen/Croft.) They talked about "inverted lists", and > > this together with other hints from this list may be a good starter. > > If you need a full-text search capability on natural language > documents, there are various libraries you could reuse: Xapian, Lucene > (although this is one is writtein Java, IIRC), Estraier, OpenFTS, a > MySQL component, and probably many more. constraint: have to use PostgreSQL. Does it have fulltext search? Ciao, Oliver ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-17 20:57 ` Oliver Bandel @ 2005-11-17 22:02 ` Florian Weimer 0 siblings, 0 replies; 30+ messages in thread From: Florian Weimer @ 2005-11-17 22:02 UTC (permalink / raw) To: Oliver Bandel; +Cc: caml-list * Oliver Bandel: >> If you need a full-text search capability on natural language >> documents, there are various libraries you could reuse: Xapian, Lucene >> (although this is one is writtein Java, IIRC), Estraier, OpenFTS, a >> MySQL component, and probably many more. > > constraint: have to use PostgreSQL. > > Does it have fulltext search? OpenFTS is based on PostgreSQL and its tsearch2 contrib module. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-16 23:42 [1/2 OT] Indexing (and mergeable Index-algorithms) Oliver Bandel 2005-11-17 8:15 ` [Caml-list] " skaller 2005-11-17 8:35 ` Florian Hars @ 2005-11-17 11:49 ` Florian Weimer 2005-11-17 13:55 ` Richard Jones 2005-11-18 14:54 ` Jonathan Bryant [not found] ` <437CD0E5.8080503@yahoo.fr> [not found] ` <437BD5F5.6010307@1969web.com> 4 siblings, 2 replies; 30+ messages in thread From: Florian Weimer @ 2005-11-17 11:49 UTC (permalink / raw) To: Oliver Bandel; +Cc: caml-list * Oliver Bandel: > So, say we have 10^6 texts that we want ot have an index for, > to retrieve the text according to some parts of the text > (keywords, substrings,...). > We want to make an index from these texts. B-trees are often used for that purpose. > After a while we get 10^5 new texts and want to extend > the exisiting index, so that the whole index not necessarily > must be created again, with the indexer-tool running on > all files (^10^6 + 10^5) again, but only have to index the new files, > but the big index can be extended with additional smaller indizes. 10^6 documents won't fit into available RAM. What's worse, you cannot afford to reindex everything just because the machine crashed. This means that you need some form of crash recovery (e.g. write-ahead logging). > Is there something like that already existing? Plenty. Berkeley DB, SQLite, full-blown SQL database servers like PostgreSQL or MySQL. The list is pretty long. > It's mainly a question on datastructures/algorithms, so this mailing list > may be the wrong, but the reason to aske here is: Are functional > datastructures in some way good for implementing such tools? They are called "log-structured databases". Typically, they offer superior write performance (especially if you can waste a lot of disk space and delay garbage collection), but read access cannot match performance of more traditional approaches (like B-trees plus write-ahead logging). > (Maybe using OCaml, but the imperative features of it would help, > if the functional features would be too slow?) If I were you, I'd reuse existing libraries (not written in Objective Caml), so this question wouldn't be relevant. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-17 11:49 ` Florian Weimer @ 2005-11-17 13:55 ` Richard Jones 2005-11-18 14:54 ` Jonathan Bryant 1 sibling, 0 replies; 30+ messages in thread From: Richard Jones @ 2005-11-17 13:55 UTC (permalink / raw) To: caml-list On Thu, Nov 17, 2005 at 12:49:55PM +0100, Florian Weimer wrote: > Plenty. Berkeley DB, SQLite, full-blown SQL database servers like > PostgreSQL or MySQL. The list is pretty long. We use PostgreSQL's tsearch2[1] module to index web pages across our main site and customer sites. Today we have 38,437 pages including old versions in the index. Pros: * Extremely easy to use - you just insert pages as rows in the database. * Very featureful - does stemming, multiple language support, etc. * Works from OCaml using, eg., ocamldbi, OCaml-PostgreSQL module. Cons: * Quite hard to install - you need to read the documentation carefully. * Slow for lookups - I haven't quite got to the bottom of this so I don't know if it's inherently slow or if I haven't set up the indexes right. Rich. [1] http://www.sai.msu.su/~megera/oddmuse/index.cgi/tsearch-v2-intro -- Richard Jones, CTO Merjis Ltd. Merjis - web marketing and technology - http://merjis.com Team Notepad - intranets and extranets for business - http://team-notepad.com ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-17 11:49 ` Florian Weimer 2005-11-17 13:55 ` Richard Jones @ 2005-11-18 14:54 ` Jonathan Bryant 2005-11-18 14:22 ` Oliver Bandel 2005-11-18 14:45 ` Florian Weimer 1 sibling, 2 replies; 30+ messages in thread From: Jonathan Bryant @ 2005-11-18 14:54 UTC (permalink / raw) To: caml-list On Thu, 2005-11-17 at 06:49, Florian Weimer wrote: > Plenty. Berkeley DB, SQLite, full-blown SQL database servers like > PostgreSQL or MySQL. The list is pretty long. > After a quick search that only returned one dead project (Alpha, v0.0.1, last updated 2002), I have to ask: are there any OCaml bindings for BerkeleyDB? --Jonathan Bryant jtbryant@valdosta.edu Student Intern Unix System Operations VSU Information Technology "Das Leben ohne Music ist einfach ein Irrtum, eine Strapaze, ein" Exil." ("Life without music is simply an error, a pain, an exile.") --Frederich Nietzsche "The three cardinal values of a programmer are laziness, impatience, and hubris." --Perl Man Page ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-18 14:54 ` Jonathan Bryant @ 2005-11-18 14:22 ` Oliver Bandel 2005-11-18 14:37 ` Florian Weimer 2005-11-18 14:45 ` Florian Weimer 1 sibling, 1 reply; 30+ messages in thread From: Oliver Bandel @ 2005-11-18 14:22 UTC (permalink / raw) To: caml-list On Fri, Nov 18, 2005 at 09:54:09AM -0500, Jonathan Bryant wrote: > On Thu, 2005-11-17 at 06:49, Florian Weimer wrote: > > > Plenty. Berkeley DB, SQLite, full-blown SQL database servers like > > PostgreSQL or MySQL. The list is pretty long. > > > > After a quick search that only returned one dead project (Alpha, v0.0.1, > last updated 2002), I have to ask: are there any OCaml bindings for > BerkeleyDB? Isn't Dbm-module that thing? IMHO it is. But there only string -> string can be used. So your key as well as value has to be string-type. Ciao, Oliver ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-18 14:22 ` Oliver Bandel @ 2005-11-18 14:37 ` Florian Weimer 2005-11-18 15:05 ` Thomas Fischbacher 2005-11-18 16:13 ` Oliver Bandel 0 siblings, 2 replies; 30+ messages in thread From: Florian Weimer @ 2005-11-18 14:37 UTC (permalink / raw) To: Oliver Bandel; +Cc: caml-list * Oliver Bandel: >> After a quick search that only returned one dead project (Alpha, v0.0.1, >> last updated 2002), I have to ask: are there any OCaml bindings for >> BerkeleyDB? > > Isn't Dbm-module that thing? No, it isn't. Berkeley DB offers much richer functionality, including transactions and replication. > But there only string -> string can be used. > So your key as well as value has to be string-type. You have to use some serialization/deserialization code. The same issue arises if you use an SQL database. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-18 14:37 ` Florian Weimer @ 2005-11-18 15:05 ` Thomas Fischbacher 2005-11-18 15:14 ` Florian Weimer 2005-11-18 16:13 ` Oliver Bandel 1 sibling, 1 reply; 30+ messages in thread From: Thomas Fischbacher @ 2005-11-18 15:05 UTC (permalink / raw) To: Florian Weimer; +Cc: Oliver Bandel, caml-list On Fri, 18 Nov 2005, Florian Weimer wrote: > >> After a quick search that only returned one dead project (Alpha, v0.0.1, > >> last updated 2002), I have to ask: are there any OCaml bindings for > >> BerkeleyDB? > > > > Isn't Dbm-module that thing? > > No, it isn't. Berkeley DB offers much richer functionality, including > transactions and replication. However, crashes at unfortunate moments may lead to database corruption, plus the license is a bit strange. If you can live with that, however, it's a pretty neat thing. -- regards, tf@cip.physik.uni-muenchen.de (o_ Thomas Fischbacher - http://www.cip.physik.uni-muenchen.de/~tf //\ (lambda (n) ((lambda (p q r) (p p q r)) (lambda (g x y) V_/_ (if (= x 0) y (g g (- x 1) (* x y)))) n 1)) (Debian GNU) ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-18 15:05 ` Thomas Fischbacher @ 2005-11-18 15:14 ` Florian Weimer 2005-11-18 16:03 ` Thomas Fischbacher 2005-11-18 20:01 ` Gerd Stolpmann 0 siblings, 2 replies; 30+ messages in thread From: Florian Weimer @ 2005-11-18 15:14 UTC (permalink / raw) To: Thomas Fischbacher; +Cc: Oliver Bandel, caml-list * Thomas Fischbacher: >> No, it isn't. Berkeley DB offers much richer functionality, including >> transactions and replication. > > However, crashes at unfortunate moments may lead to database > corruption, All my crashes could be traced back to operator error and faulty hardware. Berkeley DB has a very bad reputation when it comes to data integrity and durability, but in my experience, this is simply not true. > plus the license is a bit strange. It's a very simple, GPL-compatible copyleft license. Pretty standard. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-18 15:14 ` Florian Weimer @ 2005-11-18 16:03 ` Thomas Fischbacher 2005-11-18 20:03 ` Gerd Stolpmann 2005-11-18 20:01 ` Gerd Stolpmann 1 sibling, 1 reply; 30+ messages in thread From: Thomas Fischbacher @ 2005-11-18 16:03 UTC (permalink / raw) To: Florian Weimer; +Cc: Oliver Bandel, caml-list On Fri, 18 Nov 2005, Florian Weimer wrote: > >> No, it isn't. Berkeley DB offers much richer functionality, including > >> transactions and replication. > > > > However, crashes at unfortunate moments may lead to database > > corruption, > > All my crashes could be traced back to operator error and faulty > hardware. Berkeley DB has a very bad reputation when it comes to data > integrity and durability, but in my experience, this is simply not > true. I experienced data loss in the past - twice in eight years of running it under considerable load. Both times, we had good recent backups, so this was not a big issue. > > plus the license is a bit strange. > > It's a very simple, GPL-compatible copyleft license. Pretty standard. Hm, I just had a look and you are right. I thought I'd once read something about it being free only for single-server installations or non-commercial use, but I can't find that in the Debian package's copyright. Strange. -- regards, tf@cip.physik.uni-muenchen.de (o_ Thomas Fischbacher - http://www.cip.physik.uni-muenchen.de/~tf //\ (lambda (n) ((lambda (p q r) (p p q r)) (lambda (g x y) V_/_ (if (= x 0) y (g g (- x 1) (* x y)))) n 1)) (Debian GNU) ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-18 16:03 ` Thomas Fischbacher @ 2005-11-18 20:03 ` Gerd Stolpmann 0 siblings, 0 replies; 30+ messages in thread From: Gerd Stolpmann @ 2005-11-18 20:03 UTC (permalink / raw) To: Thomas Fischbacher; +Cc: Florian Weimer, Oliver Bandel, caml-list Am Freitag, den 18.11.2005, 17:03 +0100 schrieb Thomas Fischbacher: > Hm, I just had a look and you are right. I thought I'd once read something > about it being free only for single-server installations or non-commercial > use, but I can't find that in the Debian package's copyright. Strange. The license was changed some time ago. I can remember the initial commercial BDB release (i.e. BDB >= 2.0) was GPL-incompatible. Gerd -- ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de Telefon: 06151/153855 Telefax: 06151/997714 ------------------------------------------------------------ ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-18 15:14 ` Florian Weimer 2005-11-18 16:03 ` Thomas Fischbacher @ 2005-11-18 20:01 ` Gerd Stolpmann 2005-11-18 21:12 ` Florian Weimer 1 sibling, 1 reply; 30+ messages in thread From: Gerd Stolpmann @ 2005-11-18 20:01 UTC (permalink / raw) To: Florian Weimer; +Cc: Thomas Fischbacher, Oliver Bandel, caml-list Am Freitag, den 18.11.2005, 16:14 +0100 schrieb Florian Weimer: > * Thomas Fischbacher: > > >> No, it isn't. Berkeley DB offers much richer functionality, including > >> transactions and replication. > > > > However, crashes at unfortunate moments may lead to database > > corruption, > > All my crashes could be traced back to operator error and faulty > hardware. Berkeley DB has a very bad reputation when it comes to data > integrity and durability, but in my experience, this is simply not > true. I think it is true. A good database must not become corrupt even if the hardware crashes. It must be possible to simply undo the last operation after a crash, and the integrity must not suffer from that. Although BDB promises this stability, I had to throw away a number of databases after hardware crashes. There are simply errors in it, and I cannot recommend it anymore. Gerd -- ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de Telefon: 06151/153855 Telefax: 06151/997714 ------------------------------------------------------------ ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-18 20:01 ` Gerd Stolpmann @ 2005-11-18 21:12 ` Florian Weimer 0 siblings, 0 replies; 30+ messages in thread From: Florian Weimer @ 2005-11-18 21:12 UTC (permalink / raw) To: Gerd Stolpmann; +Cc: Thomas Fischbacher, Oliver Bandel, caml-list * Gerd Stolpmann: >> All my crashes could be traced back to operator error and faulty >> hardware. Berkeley DB has a very bad reputation when it comes to data >> integrity and durability, but in my experience, this is simply not >> true. > > I think it is true. > > A good database must not become corrupt even if the hardware crashes. It depends on the crash. In my case, hardware flipped bits randomly under high load, eventually leading to a kernel crash. This is hard to correct in software. Generally, it's better to fix the broken hardware. I haven't lost data due to simple crashes yet (power outages, crash-early kernel bugs, or MCEs because of thermal issues -- knock on wood). ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-18 14:37 ` Florian Weimer 2005-11-18 15:05 ` Thomas Fischbacher @ 2005-11-18 16:13 ` Oliver Bandel 1 sibling, 0 replies; 30+ messages in thread From: Oliver Bandel @ 2005-11-18 16:13 UTC (permalink / raw) To: caml-list On Fri, Nov 18, 2005 at 03:37:39PM +0100, Florian Weimer wrote: > * Oliver Bandel: > > >> After a quick search that only returned one dead project (Alpha, v0.0.1, > >> last updated 2002), I have to ask: are there any OCaml bindings for > >> BerkeleyDB? > > > > Isn't Dbm-module that thing? > > No, it isn't. Berkeley DB offers much richer functionality, including > transactions and replication. > > > But there only string -> string can be used. > > So your key as well as value has to be string-type. > > You have to use some serialization/deserialization code. The same > issue arises if you use an SQL database. Maybe rewriting from scratch would make sense. Then I neither need a SQL-database nor writing interface code. If the Dbm is fast enbough and can handle such big amounts of data/files efficiently enough, I would write my own indexer then, using the mentioned idea of inverted lists (but using a think like inverted trees). Ciao, Oliver ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-18 14:54 ` Jonathan Bryant 2005-11-18 14:22 ` Oliver Bandel @ 2005-11-18 14:45 ` Florian Weimer 1 sibling, 0 replies; 30+ messages in thread From: Florian Weimer @ 2005-11-18 14:45 UTC (permalink / raw) To: jtbryant; +Cc: caml-list * Jonathan Bryant: > After a quick search that only returned one dead project (Alpha, v0.0.1, > last updated 2002), I have to ask: are there any OCaml bindings for > BerkeleyDB? Lex Stein maintains a binding: <http://www.eecs.harvard.edu/~stein/> ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <437CD0E5.8080503@yahoo.fr>]
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) [not found] ` <437CD0E5.8080503@yahoo.fr> @ 2005-11-17 20:02 ` Oliver Bandel [not found] ` <437CE8EC.1070109@yahoo.fr> 0 siblings, 1 reply; 30+ messages in thread From: Oliver Bandel @ 2005-11-17 20:02 UTC (permalink / raw) To: caml-list On Thu, Nov 17, 2005 at 07:50:13PM +0100, sejourne_kevin wrote: > Oliver Bandel a écrit : > >Any hints here? > >(Maybe using OCaml, but the imperative features of it would help, > > if the functional features would be too slow?) > > > >Any hint on algorithms/datastructures for this would be fine... > > > > In my work we use "Lucene" (apache.org). Lucene is in java but have been > re-coded in few other langage (c++,python...), the spec of the infex > format is availble for free (online). > > I think the real problem for 10^x (x > 6) docs is the size of the index, > not really the speed of the answer(not for lucene (fast and small)). The size of the index as well as creating time of it. If it's not updateble and must be created new for every update, than this needs to much time. If updates are cheap, then thats the right thing. Ciao, Oliver ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <437CE8EC.1070109@yahoo.fr>]
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) [not found] ` <437CE8EC.1070109@yahoo.fr> @ 2005-11-17 20:41 ` Oliver Bandel 2005-11-18 15:06 ` Florian Hars 0 siblings, 1 reply; 30+ messages in thread From: Oliver Bandel @ 2005-11-17 20:41 UTC (permalink / raw) To: caml-list On Thu, Nov 17, 2005 at 09:32:44PM +0100, sejourne_kevin wrote: > Oliver Bandel a écrit : > >If updates are cheap, then thats the right thing. > About 1Go an Hour, not depend of the number of docs. what is "1Go"? The update/addition of 10^5 entries in 10^6 entries must be ready in about 20 minutes (maximum; faster is better). Creating the first/main index can be hours, that's not the problem, but updates must be fast. Ciao, Oliver ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) 2005-11-17 20:41 ` Oliver Bandel @ 2005-11-18 15:06 ` Florian Hars 0 siblings, 0 replies; 30+ messages in thread From: Florian Hars @ 2005-11-18 15:06 UTC (permalink / raw) To: Oliver Bandel; +Cc: caml-list Oliver Bandel wrote: > what is "1Go"? One Giga Octet == 1GB (this *is* a french language list, after all :-)). Yours, Florian. ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <437BD5F5.6010307@1969web.com>]
* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) [not found] ` <437BD5F5.6010307@1969web.com> @ 2005-11-17 20:10 ` Oliver Bandel 0 siblings, 0 replies; 30+ messages in thread From: Oliver Bandel @ 2005-11-17 20:10 UTC (permalink / raw) To: caml-list On Wed, Nov 16, 2005 at 04:59:33PM -0800, Karl Zilles wrote: > Oliver Bandel wrote: > >I'm looking for indexing algorithms and especially - if > >such a thing exists - mergeable/extendable indexing algorithms. > > > >So, say we have 10^6 texts that we want ot have an index for, > >to retrieve the text according to some parts of the text > >(keywords, substrings,...). > >We want to make an index from these texts. > > > >After a while we get 10^5 new texts and want to extend > >the exisiting index, so that the whole index not necessarily > >must be created again, with the indexer-tool running on > >all files (^10^6 + 10^5) again, but only have to index the new files, > >but the big index can be extended with additional smaller indizes. > > > >Is there something like that already existing? > >Or must the new index be created on all files again, > >or must there be a workaround with the big and a small index-file, > >where handling of both would be a solution we must provide by ourselfes? > > I wrote a text indexing system a while ago in C++. Pardon me if none of > the following is of interest: [...] > The B* file gets big, but the locations file gets huge. Using this > methodology, you only ever modify the B* file, and never have to > reprocess your indexed documents. Also you can continue to search the > index while documents are being indexed. Wow, that is fine! :) But mmap for example (you mentioned better/newer implementations) can't be used for the whole index, because it definitely will be larger than the available memory. So mmap and such techniques can only be used for parts of the index (but it may be a good choice to use it). The update time was critical, but with indexes like you mentioned, where reading while updating is possible (and when updating is fast), this is very fine. :) Best Regards, and Ciao, Oliver ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2005-11-18 21:12 UTC | newest] Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-11-16 23:42 [1/2 OT] Indexing (and mergeable Index-algorithms) Oliver Bandel 2005-11-17 8:15 ` [Caml-list] " skaller 2005-11-17 15:09 ` Brian Hurt 2005-11-17 17:31 ` skaller 2005-11-17 18:08 ` Brian Hurt 2005-11-17 18:57 ` skaller 2005-11-17 22:15 ` Brian Hurt 2005-11-18 1:49 ` skaller 2005-11-17 8:35 ` Florian Hars 2005-11-17 9:24 ` Oliver Bandel 2005-11-17 12:39 ` Florian Weimer 2005-11-17 20:57 ` Oliver Bandel 2005-11-17 22:02 ` Florian Weimer 2005-11-17 11:49 ` Florian Weimer 2005-11-17 13:55 ` Richard Jones 2005-11-18 14:54 ` Jonathan Bryant 2005-11-18 14:22 ` Oliver Bandel 2005-11-18 14:37 ` Florian Weimer 2005-11-18 15:05 ` Thomas Fischbacher 2005-11-18 15:14 ` Florian Weimer 2005-11-18 16:03 ` Thomas Fischbacher 2005-11-18 20:03 ` Gerd Stolpmann 2005-11-18 20:01 ` Gerd Stolpmann 2005-11-18 21:12 ` Florian Weimer 2005-11-18 16:13 ` Oliver Bandel 2005-11-18 14:45 ` Florian Weimer [not found] ` <437CD0E5.8080503@yahoo.fr> 2005-11-17 20:02 ` Oliver Bandel [not found] ` <437CE8EC.1070109@yahoo.fr> 2005-11-17 20:41 ` Oliver Bandel 2005-11-18 15:06 ` Florian Hars [not found] ` <437BD5F5.6010307@1969web.com> 2005-11-17 20:10 ` Oliver Bandel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox