This is a demonstration of maximum subsequence segmentation with supervised scoring
(trigram and most-recent-unclosed-tag features),
as described in our paper ("
Extracting Article Text from the Web with Maximum
Subsequence Segmentation", Pasternack & Roth, WWW 2009).
The training data we used can be found
here.
Please note that this is intended as a
demo; constraints (such as the <HR> constraint
described in our paper) and ad/embed removal are not applied to the output.
If you would like to use MSS in your own research applications, you may download
the library (.Net 2.0+)
here.
You may also call this demo as a web service by retrieving either
http://took.cs.uiuc.edu/MSS/?offset
or
http://took.cs.uiuc.edu/MSS/?text with the text to process (UTF8
format) as the HTTP request's post data.