MSS Demo

This is a demonstration of maximum subsequence segmentation with supervised scoring (trigram and most-recent-unclosed-tag features), as described in our paper ("Extracting Article Text from the Web with Maximum Subsequence Segmentation", Pasternack & Roth, WWW 2009).  The training data we used can be found here.

Please note that this is intended as a demo; constraints (such as the <HR> constraint described in our paper) and ad/embed removal are not applied to the output.  If you would like to use MSS in your own research applications, you may download the library (.Net 2.0+) here.

You may also call this demo as a web service by retrieving either http://took.cs.uiuc.edu/MSS/?offset or http://took.cs.uiuc.edu/MSS/?text with the text to process (UTF8 format) as the HTTP request's post data.
  1. http://took.cs.uiuc.edu/MSS/?text returns (as the HTTP response's content) the extracted text, including all HTML markup.
  2. http://took.cs.uiuc.edu/MSS/?offset returns the 0-based index of the first character in the extracted text, followed by a comma, and the length of the extracted text, e.g. "204,334" for an extraction of 334 characters starting at character #204.

If you intend to make a large number of requests, please space them out (e.g. one second apart) to avoid overwhelming the server.

 

Enter the URL of the web page from which to extract the article:


-or- Enter the HTML code of the document directly:
 






Here's the result: