Lighthouse Near Duplicate Clustering Service
Lighthouse offers near duplicate clustering service. This service identifies and groups near duplicate documents. These are documents that are similar, but not necessarily identical. Within every litigation data set there are typically a large number of very similar documents. These documents contain substantially the same information, but would not be separated out through a standard de-duplication process.
Examples of the type of documents that are similar but not identical include:
- Word documents used to create PDF versions of the same documents
- The same email string produced by different custodians
- Electronic records that were used to print paper versions that were scanned
- Copies of faxes matched to the original document used to send the fax
Most collections will typically have 30-40% of the records that can be classified as near duplicates (source: Syngence). Having the same reviewer review entire near duplicate clusters at one time speeds the review process, reduces cost and lowers litigation risks by affording consistent treatment for similar records.
Lighthouse utilizes SynthetixND software created by industry leader Syngence. This software works by creating linguistic profiles of the terms used in each document in a collection. It then groups documents with profiles that meet a user-controlled degree of sameness. One record in each group is designated as the master and other records within the group are compared and scored against the master. Records can be allocated to reviewers in batches containing complete near duplicate groups. Reviewers focus their attention on the master and can apply review decisions to the entire near duplicate group or can make individual document-level review decisions.
The clustered data can be loaded into all major review tools.
