* Parsing with two scanners(ources) as input (?)
@ 2010-04-13 19:15 Oliver Bandel
0 siblings, 0 replies; only message in thread
From: Oliver Bandel @ 2010-04-13 19:15 UTC (permalink / raw)
To: caml-list
Hello,
I want to pasre HTML, and with that I mean I want to parse the
structure of the tags as well as the contents of the data-elements.
At the moment I'm hacking a special parser for this case,
but it's somehow ugly, because I need to hand-code the state machine
of the parser, and it somehow becomes ugly.
It would be easier and more elegant, if I could combine the
HTML-tag-parsing together with the text-parsing on the data-elements.
For HTML-parsing I use Nethtml.
For Text-Scanning I use Pcre.
I want to be able to select certain tags and text that will occur at
certain positions.
For detecting the found tags I look for Nethtml's
Element (name, args, subnodes)
and for detecting the data-strings I look into Nethtml's
Data string
with Pcre.
I would like to find out certain data that occurs after ceratin
sequences in the tree and then look for certain strings inside that
Data-strings.
Any idea on how to create the parser?
I thought about somehow wrapping the stuff and give it to ocamlyacc.
Maybe menhir is better for that task?
At the moment I use the Element-match just to call the recursive
parser on the next doclist.
All my parsing is using Data-match and looks up for the contents there.
This is, because the information I want to parse out of the document
is flat text inside that data-string.
But some of that infomation could also be found via Tag-sequences.
So I'm looking for a possibility to combine both kinds of attempts.
How to do it?
Oliver
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2010-04-13 19:15 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-13 19:15 Parsing with two scanners(ources) as input (?) Oliver Bandel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox