COPS is a system that can be used to aid in the detection of undesired document distribution. A set of registered documents exists, against which suspect documents are checked. The system could be used for example, as part of a USENET gateway, to check for articles that contain text from a registered document.
The basic idea is to take each sentence in a group of registered documents and insert it into a hash table. Now, when we wish to test a new document we can find any sentences that match a previously registered document in time linear with the size of the new document, rather than the size of the registered set, which may be quite large. Now, if our new document has more than say 5% of its sentences in common with a previous document, then it seems likely that the text was plagerized, and the document could be flagged for human inspection.
The basic idea can be extended to consider groups of sentences together, or groups of words, rather than single sentences. The method of hashing english text is also interesting. We have investigated several of these possibilities, and results can be found in the references below.