* Re: [Caml-list] Substring search on an array of strings [not found] <20040625165512.GG595@speakeasy.org> @ 2004-06-25 17:05 ` Diego Olivier Fernandez Pons 2004-06-25 17:25 ` Shawn Wagner 0 siblings, 1 reply; 5+ messages in thread From: Diego Olivier Fernandez Pons @ 2004-06-25 17:05 UTC (permalink / raw) To: Shawn Wagner; +Cc: caml-list Bonjour, > If you need to find every substring in every element of the array > then, yes, you'll have to look at every element of the array to see > if that substring appears or not in it. No, or at least only in the worst case. In the same way the Knuth-Morris-Pratt suffix automaton allows you not to search for every position in a string if it is not necessary, a specific procedure for searching in an array of strings could avoid you searching on every string. I will try to give some simple examples to get some intuition : - Suppose there are many strings that only contains a few different letters. Then you could build first a map from subsets of the alphabet to string indexes and only search in those strings - Suppose that the strings have a particular pattern (say "processortype.memory.operatingsystem"). Then you could build indexes on every processor type, memory or operating system, and search only in the corresponding strings Diego Olivier ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Caml-list] Substring search on an array of strings 2004-06-25 17:05 ` [Caml-list] Substring search on an array of strings Diego Olivier Fernandez Pons @ 2004-06-25 17:25 ` Shawn Wagner 0 siblings, 0 replies; 5+ messages in thread From: Shawn Wagner @ 2004-06-25 17:25 UTC (permalink / raw) To: caml-list On Fri, Jun 25, 2004 at 07:05:13PM +0200, Diego Olivier Fernandez Pons wrote: > Bonjour, > > > If you need to find every substring in every element of the array > > then, yes, you'll have to look at every element of the array to see > > if that substring appears or not in it. > > No, or at least only in the worst case. > > In the same way the Knuth-Morris-Pratt suffix automaton allows you not > to search for every position in a string if it is not necessary, a > specific procedure for searching in an array of strings could avoid > you searching on every string. > > I will try to give some simple examples to get some intuition : > > - Suppose there are many strings that only contains a few different > letters. Then you could build first a map from subsets of the alphabet > to string indexes and only search in those strings And, of course, to build that map you have to look at every element of the array. Of course, you can then re-use the map, but it has to be generated somehow. > > - Suppose that the strings have a particular pattern (say > "processortype.memory.operatingsystem"). Then you could build indexes > on every processor type, memory or operating system, and search only > in the corresponding strings > You still have to look at the strings to find out where those corresponding strings are. Now, with something like your second example you can sort the array (Once again having to look at everything in the array) and use a binary search to find the range of strings starting with, say, athlon, without having to look at every element in that particular search. Your original questions didn't say anything about such extra information or patterns in the strings you're searching. If you want an answer you like, you have to give us enough data about the problem that we have a hope of coming up with one! Don't say: How do I find every substring in an array of strings. Say: In an array of strings of the format 'processortype.memory.operatingsystem', how can I find all strings that have 'linux' in the operatingsystem field? Aside: Why did you cc'ing caml-list when I only replied to you in that last email? (This one is going to caml-list too) -- Shawn Wagner shawnw@speakeasy.org ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 5+ messages in thread
* [Caml-list] Substring search on an array of strings @ 2004-06-24 7:52 Diego Olivier Fernandez Pons 2004-06-24 15:51 ` Shawn Wagner 0 siblings, 1 reply; 5+ messages in thread From: Diego Olivier Fernandez Pons @ 2004-06-24 7:52 UTC (permalink / raw) To: caml-list Bonjour, Caml Str library does not seem to provide a function allowing to match efficently a string in a array of strings. I was wondering if there were already known algorithms for this before I try the workarounds I had in mind : i) building a 'superstring' with the array of strings (the shortest superstring problem is NP but there are simple primal-dual approximation algorithms) and use a classical search algorithm. ii) try to lift the suffix Knuth-Morris-Prat automaton to work on the union of the words instead of a single word (is this possible ?) Diego Olivier ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Caml-list] Substring search on an array of strings 2004-06-24 7:52 Diego Olivier Fernandez Pons @ 2004-06-24 15:51 ` Shawn Wagner 2004-06-25 16:21 ` Diego Olivier Fernandez Pons 0 siblings, 1 reply; 5+ messages in thread From: Shawn Wagner @ 2004-06-24 15:51 UTC (permalink / raw) To: caml-list On Thu, Jun 24, 2004 at 09:52:37AM +0200, Diego Olivier Fernandez Pons wrote: > Bonjour, > > Caml Str library does not seem to provide a function allowing to match > efficently a string in a array of strings. > > I was wondering if there were already known algorithms for this before > I try the workarounds I had in mind : Do you want to find out if one of the strings in the array equals some other string, or if that other string appears as a substring in one of the array's strings? Either way, annexlib has the needed functions: http://raevnos.pennmush.org/code/annexlib/ -- Shawn Wagner shawnw@speakeasy.org ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Caml-list] Substring search on an array of strings 2004-06-24 15:51 ` Shawn Wagner @ 2004-06-25 16:21 ` Diego Olivier Fernandez Pons 0 siblings, 0 replies; 5+ messages in thread From: Diego Olivier Fernandez Pons @ 2004-06-25 16:21 UTC (permalink / raw) To: Shawn Wagner; +Cc: caml-list Bonjour, > Do you want to find out if one of the strings in the array equals some other > string, or if that other string appears as a substring in one of the array's > strings? Either way, annexlib has the needed functions: > http://raevnos.pennmush.org/code/annexlib/ Find out if a string appears as a substring in an array of strings and return the list of couples (string_index, position) of all matches. After having examined annexlib, it seems that it doesn't have a _specific_ function to do what I want. Of course I could always launch Array.size searches, one on each string. But it is a performance issue of course because the database is really big. Diego Olivier ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2004-06-25 17:22 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20040625165512.GG595@speakeasy.org> 2004-06-25 17:05 ` [Caml-list] Substring search on an array of strings Diego Olivier Fernandez Pons 2004-06-25 17:25 ` Shawn Wagner 2004-06-24 7:52 Diego Olivier Fernandez Pons 2004-06-24 15:51 ` Shawn Wagner 2004-06-25 16:21 ` Diego Olivier Fernandez Pons
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox