* [Caml-list] [CAML]:: efficient data structure for storing and searching int list list @ 2013-04-12 14:36 沈胜宇 2013-04-12 15:01 ` simon cruanes ` (2 more replies) 0 siblings, 3 replies; 8+ messages in thread From: 沈胜宇 @ 2013-04-12 14:36 UTC (permalink / raw) To: caml-list [-- Attachment #1: Type: text/plain, Size: 634 bytes --] Dear all: I have an int list list, whose name is LL and I need to frequently decide whether a particular int list, whose name is L, is a sublist of an element of LL. Is there any efficent data structure to do this? At the mean time, I store LL as (int, bool) Hashtbl.t list, that is, each element of LL is stored as a hash table. So searching L in LL is reduce to decide whether there exist an element of LL, such every element of L hit in this element. At the mean time, the space is not a big problem, but the run time overhead is major concern, So if there exist any more faster data structure? Thank you Shen [-- Attachment #2: Type: text/html, Size: 867 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list 2013-04-12 14:36 [Caml-list] [CAML]:: efficient data structure for storing and searching int list list 沈胜宇 @ 2013-04-12 15:01 ` simon cruanes 2013-04-12 15:48 ` Jean-Francois Monin 2013-04-12 22:15 ` Toby Kelsey 2 siblings, 0 replies; 8+ messages in thread From: simon cruanes @ 2013-04-12 15:01 UTC (permalink / raw) To: caml-list If the order in the lists does not matter, I would suggest some kind of Trie (http://en.wikipedia.org/wiki/Trie) to store the *sorted* int lists; the algorithm for search would recursively explore all the branches of the trie that can be a superlist of the input list. Here is a code snippet (not thoroughly tested): (* ------------------- %< ------ >% ----------------- *) type trie = | Node of bool * (* end of a list? *) (int * trie) list (* subtries, indexed by their first element *) let empty = Node (false, []) (* add [l] to [trie], assuming [l] is sorted *) let rec add trie l = match trie, l with | Node (_, subtries), [] -> Node (true, subtries) | Node (b, subtries), x::l' -> let subtrie = try List.assoc x subtries with Not_found -> Node (false, []) in (* recursive add *) let subtrie = add subtrie l' in let subtries = List.remove_assoc x subtries in Node (b, (x,subtrie) :: subtries) (* find whether [l] is a sublist of some list of [trie] *) let rec find trie l = match trie, l with | _, [] -> true | Node (_, subtries), (x::l') -> find_list x subtries l' and find_list x subtries l' = match subtries with | [] -> false | (y,subtrie)::subtries' -> (if y < x then find subtrie (x::l') else if y = x then find subtrie l' else false) || find_list x subtries' l' (* ------------------- %< ------ >% ----------------- *) Cheers! Simon On 12/04/2013 16:36, 沈胜宇 wrote: > Dear all: > > I have an int list list, whose name is LL > > and I need to frequently decide whether a particular int list, whose > name is L, is a sublist of an element of LL. > > Is there any efficent data structure to do this? > > At the mean time, I store LL as (int, bool) Hashtbl.t list, that is, > each element of LL is stored as a hash table. > > So searching L in LL is reduce to decide whether there exist an element > of LL, such every element of L hit in this element. > > At the mean time, the space is not a big problem, but the run time > overhead is major concern, > > So if there exist any more faster data structure? > > Thank you > > Shen > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list 2013-04-12 14:36 [Caml-list] [CAML]:: efficient data structure for storing and searching int list list 沈胜宇 2013-04-12 15:01 ` simon cruanes @ 2013-04-12 15:48 ` Jean-Francois Monin 2013-04-13 6:58 ` 沈胜宇 2013-04-12 22:15 ` Toby Kelsey 2 siblings, 1 reply; 8+ messages in thread From: Jean-Francois Monin @ 2013-04-12 15:48 UTC (permalink / raw) To: 沈胜宇; +Cc: caml-list You may have some total order on the elements of your lists. Then consider only sorted lists, and implement LL with tries. JF On Fri, Apr 12, 2013 at 10:36:22PM +0800, 沈胜宇 wrote: > Dear all: > I have an int list list, whose name is LL > and I need to frequently decide whether a particular int list, whose name > is L, is a sublist of an element of LL. > Is there any efficent data structure to do this? > At the mean time, I store LL as (int, bool) Hashtbl.t list, that is, each > element of LL is stored as a hash table. > So searching L in LL is reduce to decide whether there exist an element of > LL, such every element of L hit in this element. > At the mean time, the space is not a big problem, but the run time > overhead is major concern, > So if there exist any more faster data structure? > Thank you > Shen -- Jean-Francois Monin LIAMA Project FORMES, CNRS & Universite de Grenoble 1 & Tsinghua University ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list 2013-04-12 15:48 ` Jean-Francois Monin @ 2013-04-13 6:58 ` 沈胜宇 2013-04-13 7:56 ` Gabriel Scherer 0 siblings, 1 reply; 8+ messages in thread From: 沈胜宇 @ 2013-04-13 6:58 UTC (permalink / raw) To: Jean-Francois Monin; +Cc: caml-list Dear Monin: thank you for your help. But I think trie is too general in the sense that it did not effiecently handle the case that two list with multiple(not just one) shared sublist. For example, I first insert a list a->b->c->d->e->f into trie, and then I insert a->b->d->e into the trie. the trie can not store the second shared sublist d->e in the same place, it can only store them like a->b->c->d->e->f ->d->e do you have more suggesion on this? Shen > -----原始邮件----- > 发件人: "Jean-Francois Monin" <jean-francois.monin@imag.fr> > 发送时间: 2013-04-12 23:48:04 (星期五) > 收件人: "沈胜宇" <syshen@nudt.edu.cn> > 抄送: caml-list <caml-list@inria.fr> > 主题: Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list > > You may have some total order on the elements of your lists. > Then consider only sorted lists, and implement LL with tries. > > JF > > On Fri, Apr 12, 2013 at 10:36:22PM +0800, 沈胜宇 wrote: > > Dear all: > > I have an int list list, whose name is LL > > and I need to frequently decide whether a particular int list, whose name > > is L, is a sublist of an element of LL. > > Is there any efficent data structure to do this? > > At the mean time, I store LL as (int, bool) Hashtbl.t list, that is, each > > element of LL is stored as a hash table. > > So searching L in LL is reduce to decide whether there exist an element of > > LL, such every element of L hit in this element. > > At the mean time, the space is not a big problem, but the run time > > overhead is major concern, > > So if there exist any more faster data structure? > > Thank you > > Shen > > -- > Jean-Francois Monin > LIAMA Project FORMES, CNRS & Universite de Grenoble 1 & > Tsinghua University ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list 2013-04-13 6:58 ` 沈胜宇 @ 2013-04-13 7:56 ` Gabriel Scherer 0 siblings, 0 replies; 8+ messages in thread From: Gabriel Scherer @ 2013-04-13 7:56 UTC (permalink / raw) To: 沈胜宇; +Cc: Jean-Francois Monin, caml-list There is a fairly generic way to get an efficient data structure if you don't mind huge preprocessing costs. You can see your problem as a word recognition problem (you want to accept only words that are sublists of one of the lists in your set), so a natural data representation of this is a finite-state automaton. Getting an efficient automaton out of your data set is easy (but may be extremely costly): you only need to implement a determinization algorithm (and if you want to avoid space explosion, maybe a minimization algorithm as well) and those are well-known. Given an automaton for a list LL, you can add a new list L by creating an automaton recognizing sublists of L, making its union with your LL automaton, and determinizing again. Of course, that is a kind of giant hammer, there are probably more specialized approaches that may be suitable for your problem. I didn't understand whether you're trying to check a subsequence problem ('ac' is a subsequence of 'abcd') or a substring problem ('ab' is not a substring, while 'abc' would be). For the substring problem, a common trick is to add to your trie not only a L, but also the reversed prefixes of L: for the word 'abcd' you would store 'abcd', 'bcd|a', 'cd|ab', 'd|abc'. Checking substring inclusion is then immediate. This results in a multiplication of the memory usage; note that DFA minimization can be seen as an optimal, principled way to introduce sharing in this data structure. On Sat, Apr 13, 2013 at 8:58 AM, 沈胜宇 <syshen@nudt.edu.cn> wrote: > Dear Monin: > > thank you for your help. > > But I think trie is too general in the sense that it did not effiecently handle the case that two list with multiple(not just one) shared sublist. > > For example, I first insert a list a->b->c->d->e->f into trie, and then I insert a->b->d->e into the trie. > > the trie can not store the second shared sublist d->e in the same place, it can only store them like > a->b->c->d->e->f > ->d->e > > do you have more suggesion on this? > > Shen >> -----原始邮件----- >> 发件人: "Jean-Francois Monin" <jean-francois.monin@imag.fr> >> 发送时间: 2013-04-12 23:48:04 (星期五) >> 收件人: "沈胜宇" <syshen@nudt.edu.cn> >> 抄送: caml-list <caml-list@inria.fr> >> 主题: Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list >> >> You may have some total order on the elements of your lists. >> Then consider only sorted lists, and implement LL with tries. >> >> JF >> >> On Fri, Apr 12, 2013 at 10:36:22PM +0800, 沈胜宇 wrote: >> > Dear all: >> > I have an int list list, whose name is LL >> > and I need to frequently decide whether a particular int list, whose name >> > is L, is a sublist of an element of LL. >> > Is there any efficent data structure to do this? >> > At the mean time, I store LL as (int, bool) Hashtbl.t list, that is, each >> > element of LL is stored as a hash table. >> > So searching L in LL is reduce to decide whether there exist an element of >> > LL, such every element of L hit in this element. >> > At the mean time, the space is not a big problem, but the run time >> > overhead is major concern, >> > So if there exist any more faster data structure? >> > Thank you >> > Shen >> >> -- >> Jean-Francois Monin >> LIAMA Project FORMES, CNRS & Universite de Grenoble 1 & >> Tsinghua University > > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa.inria.fr/sympa/arc/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list 2013-04-12 14:36 [Caml-list] [CAML]:: efficient data structure for storing and searching int list list 沈胜宇 2013-04-12 15:01 ` simon cruanes 2013-04-12 15:48 ` Jean-Francois Monin @ 2013-04-12 22:15 ` Toby Kelsey 2013-04-13 6:57 ` 沈胜宇 2 siblings, 1 reply; 8+ messages in thread From: Toby Kelsey @ 2013-04-12 22:15 UTC (permalink / raw) To: caml-list; +Cc: syshen On 12/04/13 15:36, 沈胜宇 wrote: > Dear all: > I have an int list list, whose name is LL > and I need to frequently decide whether a particular int list, whose name is L, is a sublist of an element of LL. > > Is there any efficent data structure to do this? A data structure useful for finding substrings quickly is the "suffix tree", this can be built in O(n) - for small alphabets - or O(n log n) time and substring searches take O(length substring) time. The suffix tree takes more space than the original string though. An int list can take the role of the string here. Toby ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list 2013-04-12 22:15 ` Toby Kelsey @ 2013-04-13 6:57 ` 沈胜宇 2013-04-23 9:05 ` Goswin von Brederlow 0 siblings, 1 reply; 8+ messages in thread From: 沈胜宇 @ 2013-04-13 6:57 UTC (permalink / raw) To: Toby Kelsey; +Cc: caml-list Dear Toby: Thank you for your help. But my problem is a little more difference from the substring searching problem with suffix tree. In my problem, a list L1 is another list L2's sublist, is much more general that the substring problem. For example, bcd is a substring of abcde, because bcd is continuely occur in abcde. At the same time, bd is not a substring of abcde, because is is not continuesly in abcde. But in my problem, a list b->d is a sub list of a->b->c->d->e. So after reading the suffix tree introduction on wiki, I think it may not fit for my problem. I also find that trie is more general than suffix, and can be used to handle my problem. but it is too general in the sense that it di not effiecently handle the case that two list with multiple(not just one) shared sublist. For example, I first insert a list a->b->c->d->e->f into trie, and then I insert a->b->d->e into the trie. the trie can not store the second shared sublist d->e in the same place, it can only store them like a->b->c->d->e->f ->d->e So do you have more suggenhion on this ? Shen > -----原始邮件----- > 发件人: "Toby Kelsey" <toby.kelsey@gmail.com> > 发送时间: 2013-04-13 06:15:25 (星期六) > 收件人: caml-list@inria.fr > 抄送: syshen@nudt.edu.cn > 主题: Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list > > On 12/04/13 15:36, 沈胜宇 wrote: > > Dear all: > > I have an int list list, whose name is LL > > and I need to frequently decide whether a particular int list, whose name is L, is a sublist of an element of LL. > > > > Is there any efficent data structure to do this? > > A data structure useful for finding substrings quickly is the "suffix tree", > this can be built in O(n) - for small alphabets - or O(n log n) time and > substring searches take O(length substring) time. The suffix tree takes more > space than the original string though. An int list can take the role of the > string here. > > Toby ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list 2013-04-13 6:57 ` 沈胜宇 @ 2013-04-23 9:05 ` Goswin von Brederlow 0 siblings, 0 replies; 8+ messages in thread From: Goswin von Brederlow @ 2013-04-23 9:05 UTC (permalink / raw) To: caml-list On Sat, Apr 13, 2013 at 02:57:11PM +0800, ?????? wrote: > Dear Toby: > > Thank you for your help. > > But my problem is a little more difference from the substring searching problem with suffix tree. > > In my problem, a list L1 is another list L2's sublist, is much more general that the substring problem. > > For example, bcd is a substring of abcde, because bcd is continuely occur in abcde. > > At the same time, bd is not a substring of abcde, because is is not continuesly in abcde. > > But in my problem, a list b->d is a sub list of a->b->c->d->e. > > > So after reading the suffix tree introduction on wiki, I think it may not fit for my problem. > > I also find that trie is more general than suffix, and can be used to handle my problem. but it is too general in the sense that it di not effiecently handle the case that two list with multiple(not just one) shared sublist. > > For example, I first insert a list a->b->c->d->e->f into trie, and then I insert a->b->d->e into the trie. > > the trie can not store the second shared sublist d->e in the same place, it can only store them like > a->b->c->d->e->f > ->d->e > > So do you have more suggenhion on this ? > > Shen > > > -----????????----- > > ??????: "Toby Kelsey" <toby.kelsey@gmail.com> > > ????????: 2013-04-13 06:15:25 (??????) > > ??????: caml-list@inria.fr > > ????: syshen@nudt.edu.cn > > ????: Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list > > > > On 12/04/13 15:36, ?????? wrote: > > > Dear all: > > > I have an int list list, whose name is LL > > > and I need to frequently decide whether a particular int list, whose name is L, is a sublist of an element of LL. > > > > > > Is there any efficent data structure to do this? > > > > A data structure useful for finding substrings quickly is the "suffix tree", > > this can be built in O(n) - for small alphabets - or O(n log n) time and > > substring searches take O(length substring) time. The suffix tree takes more > > space than the original string though. An int list can take the role of the > > string here. > > > > Toby Note: A suffix tree can be build in O(n) and takes O(n) space. Takes something like 48-64 times the space of the string in ocaml. Seems like you aren't looking for sublists (in which the order would matter) but subsets (order doesn't matter and elements are unique). You can build a lookup tree containing all subsets of each set like this: Tree with {a,b,c,d,e} inserted: +a+b+c+d-e | | | \e-d | | +d+c-e | | | \e-c | | \e+c-d | | \d-c | +c+b+d-e | | | \e-d | | +d+b-e | | | \e-b | | \e+b-d | | \d-b | +d+b+c-e | | | \e-c | | +c+b-e | | | \e-b | | \e+b-c | | \c-b | ... That gets rather large. If you not only need to know L is a subset of one of the sets in LL then each node also needs to store a list of sets containing the subset expressed so far. If you can get L sorted that reduces the tree quite a bit: +a+b+c+d-e | | | \e | | +d-e | | \e | +c+d-e | | \e | +d-e | \e +b+c+d-e | | \e | +d-e | \e +c+d-e | \e +d-e \e Since L is sorted you only need the paths that are sorted. That gives you a tree of size O(2^n) where n is the number of unique ints in all sets. Still huge but your n might be small enough. This will give you O(|L|) lookup. Alternatively to sorting L you could still use the above tree. Start at the root and check the first child: a. Is a in L? If so go down that branch, otherwise check the next child. With L as a list each lookup would be O(n). As Set it would be O(log n) and as Hashtbl.t it would O(1). MfG Goswin ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2013-04-23 9:05 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-04-12 14:36 [Caml-list] [CAML]:: efficient data structure for storing and searching int list list 沈胜宇 2013-04-12 15:01 ` simon cruanes 2013-04-12 15:48 ` Jean-Francois Monin 2013-04-13 6:58 ` 沈胜宇 2013-04-13 7:56 ` Gabriel Scherer 2013-04-12 22:15 ` Toby Kelsey 2013-04-13 6:57 ` 沈胜宇 2013-04-23 9:05 ` Goswin von Brederlow
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox