COPS Research


Contacts

Database Group - Department of Computer Science - Stanford University

Description

An increasing amount of information is being provided online. In the near future we will likely be able to use digital libraries to find and read information in journals and books, that are only available in hard copy today. However, this new ease of access will also make it much easier to violate intellectual property rights. Many publishers are concerned about providing information online. Knight-Ridder recently stopped providing certain information due to copyright infringement on the net.
Knight-Ridder Reference

COPS is a system that can be used to aid in the detection of undesired document distribution. A set of registered documents exists, against which suspect documents are checked. The system could be used for example, as part of a USENET gateway, to check for articles that contain text from a registered document.

The basic idea is to take each sentence in a group of registered documents and insert it into a hash table. Now, when we wish to test a new document we can find any sentences that match a previously registered document in time linear with the size of the new document, rather than the size of the registered set, which may be quite large. Now, if our new document has more than say 5% of its sentences in common with a previous document, then it seems likely that the text was plagerized, and the document could be flagged for human inspection.

The basic idea can be extended to consider groups of sentences together, or groups of words, rather than single sentences. The method of hashing english text is also interesting. We have investigated several of these possibilities, and results can be found in the references below.


References

Sergey Brin, James Davis, Hector Garcia-Molina,
Copy Detection Mechanisms for Digital Documents , (Postscript)
Proceedings of the ACM SIGMOD Annual Conference, San Francisco, CA, May 1995.


James Davis - jedavis@cs.stanford.edu